HMM in Python - Part I | Fangfang (“Fiona”) Sheng

This blog is about how to generate a DNA sequence using a Markov model and implemented the process in python.

First thing first, a Markov model is a stochastic model used to model randomly changing systems.[1] It is assumed that future states depend only on the current state, not on the events that occurred before it (that is, it assumes the Markov property).

The simplest Markov model is the Markov chain. It models the state of a system with a random variable that changes through time.[2] In this context, the Markov property suggests that the distribution for this variable depends only on the distribution of a previous state. Following is how I simulated the way to generate a first order Markov DNA sequence.


#!/usr/bin/env python
# Generating a DNA sequence using a Markov model implemented in python
# author: Fangfang Sheng


from numpy.random import choice
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from collections import Counter
import operator

# Create a function to plot the first order Markov data
def plotData(seq, nucleotides, title):
    """
    This function can plot the first order Markov data.

    :param seq: sequence generated
    :param nucleotides: alphabet that compounds the sequence
    :param title: title of the plot
    """
    # Count the frequency of each base
    base, base_counts = np.unique(seq, return_counts=True)
    base_percentages = (base_counts / base_counts.sum())

    # Count the frequency of dinucleotides
    dinucleotides = []
    for i in range(len(seq)-1):
        dinucleotides.append(''.join(seq[i:i+2]))
    dinucleotides_counter = Counter(dinucleotides)
    sorted(dinucleotides_counter.items(), key=operator.itemgetter(0))


    # Count the frequency of trinucleotides
    trinucleotides = []
    for i in range(len(seq)-2):
        trinucleotides.append(''.join(seq[i:i+3]))
    trinucleotides_counter = Counter(trinucleotides)
    sorted_trinucleotides = dict(sorted(trinucleotides_counter.items(), key=operator.itemgetter(0)))

    # Now, plot the results in the same plot:
    fig, axs = plt.subplots(3)
    axs[0].bar(base, base_percentages)
    axs[0].set_title(title)

    axs[1].bar(dinucleotides_counter.keys(), dinucleotides_counter.values())

    axs[2].bar(sorted_trinucleotides.keys(), sorted_trinucleotides.values())
    xticklabels = sorted_trinucleotides.keys()
    axs[2].set_xticklabels(xticklabels, rotation=90, fontsize=8)

    for ax in axs.flat:
        ax.set(xlabel='Base', ylabel='Base_Proportion')

    fig.tight_layout()
    plt.show()

def transitionmatrix(afterAprobs):
    """

    :param afterAprobs: A dictionary given the probabilities pA, pC, pG, and pT for the cases where the previous
                        nucleotide was “A”, “C”, “G”, or “T”, respectively.
    :return transitionmatrix_df: dataframe contains the element in a particular row and column of the transition matrix
                                (e.g. the row for “A”, column for “C”) holds the probability (pAC) of choosing a particular
                                nucleotide (“C”) at the current position in the sequence,
                                given that was a particular nucleotide (“A”) at the previous position in the sequence.
    """
    transitionmatrix_df = pd.DataFrame(afterAprobs)
    transitionmatrix_df.index = afterAprobs.keys()
    return transitionmatrix_df


def generateFirstOrderMarkovSeq(lengthSeq, nucleotides, initialProb, firstOrderMatrix):
    """
    This function can generate a sequence using a particular Markov model given inputs.

    :param lengthSeq: length of the sequence
    :param nucleotides: the alphabet for the sequence
    :param initialProb: initial probability distribution
    :param firstOrderMatrix: matrix that stores the probability distribution of a first order Markov Chain
    :return outputSeq: the sequence generated by the certain MM
    """
    # Create a list for storing the new sequence
    outputSeq = []
    firstnucleotide = choice(nucleotides, 1, p=initialProb)[0]
    # Store the nucleotide for the first position of the sequence
    outputSeq.append(firstnucleotide)

    # Let the computer decide:
    for i in range(1, lengthSeq):
        prevNuc = outputSeq[0]
        currentProb = list(firstOrderMatrix.loc[:, prevNuc].squeeze())
        outputSeq.append(choice(nucleotides, 1, p=currentProb)[0])
    return outputSeq

def main():
    inProb = [0.4, 0.1, 0.1, 0.4]
    nucleotides = ["A", "C", "G", "T"]

    zeroOrderSeq = choice(nucleotides, 1000, p=inProb)
    # plotData(zeroOrderSeq, nucleotides, 'Compositional bias of each nucleotide')
    transitionprob = {'A': [0.2, 0.3, 0.3, 0.2],
                      'C': [0.1, 0.41, 0.39, 0.1],
                      'G': [0.25, 0.25, 0.25, 0.25],
                      'T': [0.5, 0.17, 0.17, 0.16]
                    }
    # Got the transition matrix
    mytransitionmatrix = transitionmatrix(transitionprob)

    # Use the generateFirstOrderSeq function to generate a sequence of 30 bases long
    firstOrderSeq = generateFirstOrderMarkovSeq(30, nucleotides, inProb, mytransitionmatrix)
    # plot the data
    plotData(firstOrderSeq, nucleotides, "Markov Chain of first order")


if __name__ == '__main__':
    main()