HMM in Python - Part II | Fangfang (“Fiona”) Sheng

A hidden Markov model is a Markov chain for which the state is only partially observable. In other words, observations are related to the state of the system, but they are typically insufficient to precisely determine the state. Several well-known algorithms for hidden Markov models exist. For example, given a sequence of observations, the Viterbi algorithm will compute the most-likely corresponding sequence of states, the forward algorithm will compute the probability of the sequence of observations, and the Baum–Welch algorithm will estimate the starting probabilities, the transition function, and the observation function of a hidden Markov model.

One common use is for speech recognition, where the observed data is the speech audio waveform and the hidden state is the spoken text. In this example, the Viterbi algorithm finds the most likely sequence of spoken words given the speech audio.

#!/usr/bin/env python
# Generating a DNA sequence using a Hidden Markov model implemented in python
# author: Fangfang Sheng

from numpy.random import choice
import numpy as np
import pandas as pd


def generateFirstOrderhmmseq(lengthSeq, nucleotides, initialProb, states, transitionmatrix, emissionmatrix):
    """
    Function to generate a DNA sequence, given a HMM and the length of the sequence to be generated.

    :param lengthSeq: length of the sequence
    :param nucleotides: alphabet that compounds the sequence
    :param initialProb: initial probability distribution
    :param states: states for the HMM
    :param transitionmatrix: matrix that stores the probability distribution of transitions
    :param emissionmatrix: matrix that stores the probability distribution of emissions
    :return outputSeq: the generated DNA sequence
    """
    # Create a list for storing the new sequence
    outputSeq = []
    # Create a list for storing the state that each position in the new sequence
    mystates = []
    # Choose the state for the first position in the sequence:
    firststate = choice(states, 1, p=initialProb)[0]
    # Get the probabilities of the current nucleotide, given that we are in the state "firststate":
    probabilities = list(emissionmatrix.loc[firststate, :].squeeze())
    # Choose the nucleotide for the first position in the sequence:
    firstnucleotide = choice(nucleotides, 1, p=probabilities)[0]

    # Store the nucleotide and state for the first position of the sequence
    outputSeq.append(firstnucleotide)
    mystates.append(firststate)

    for i in range(1, lengthSeq):
        prevstate = mystates[i-1]
        # Get the probabilities of the current state
        # given that the previous nucleotide was generated by state "prevstate"
        stateprobs = list(transitionmatrix.loc[prevstate, : ].squeeze())

        # Choose the state for the ith position in the sequence:
        state = choice(states, 1, p=stateprobs)[0]
        # Get the probabilities of the current nucleotide, given that we are in the state "state":
        probabilities = list(emissionmatrix.loc[state, :].squeeze())
        # Choose the nucleotide for the ith position in the sequence:
        nucleotide = choice(nucleotides, 1, p=probabilities)[0]
        # Store the nucleotide and state for the current position of the sequence
        outputSeq.append(nucleotide)
        mystates.append(state)

    for i in range(lengthSeq):
        print(f'Position {i}, State {mystates[i]}, Nucleotide = {outputSeq[i]} ')

    return outputSeq


def main():
    lengthSeq = 15
    nucleotides = ["A", "C", "G", "T"]
    initialProb = [0.5, 0.5]
    states = ["AT-rich", "GC-rich"]
    ATrichprobs = [0.7, 0.3]
    GCrichprobs = [0.1, 0.9]
    theTransitionMatrix = pd.DataFrame([ATrichprobs, GCrichprobs], columns=states, index=states)
    ATrichstateprobs = [0.39, 0.1, 0.1, 0.41]
    GCrichstateprobs = [0.1, 0.41, 0.39, 0.1]
    theEmissionMatrix = pd.DataFrame([ATrichstateprobs, GCrichstateprobs], columns=nucleotides, index=states)
    hmmfirstOrderSeq = generateFirstOrderhmmseq(lengthSeq, nucleotides, initialProb, states, \
                                                theTransitionMatrix, theEmissionMatrix)
    print(hmmfirstOrderSeq)


if __name__ == '__main__':
    main()