Yuliang Zhang

隐马尔科夫模型与PloyA尾长度检测

张宇亮 / 2017-09-21


前几天一同学遇到了一个问题:即在mRNA的3’UTR的ployA尾部加上接头,通过二代测序的方法检测polyA尾部的长度,但由于二代测序的技术问题,会存在以下问题,即最后的PloyA由于是一连串的A会出现测不准的现象,这样就给后续分析带来了极大的困难,之前和大家一起讨论时候听到了这个问题,当时提了几个解决方案,可行度不是太强,前几天在看隐马尔科夫模型时突然想到了一个点子,如下: 先重新描述一下问题,假设存在这样一个序列,3’UTR序列接了一串PolyA

TCGCGTGACGTGTCATGCTAGAAAAAAAAAAAAAAAAAAAAAAAAAA

现在由于测序的错误变成了以下序列

TCGCGTGACGTGTCATGCTAGAAAAAAAAAGAACAAAAAGAAAAAAA

那么如何修正这些错误的碱基并得到polyA-tail的实际长度呢?假设存在两个隐状态,’ployA-tail’和’UTR’,显状态为A、T、G、C,隐状态之间的转移概率为P(‘UTR’–>‘ployA-tail’)=0.21, P(‘ployA-tail’–>‘UTR’)=0.001几乎不发生。在PolyA隐状态向显状态A、T、G、C的转移(或发散)概率为0.925、0.025、0.025、0.0252,建立以上模型之后就可以使用veterbi算法进行解码了,Python代码如下:

# -*- coding: utf-8 -*-
def viterbi(obs, states, start_p, trans_p, emit_p):
       V = [{}]
       for st in states:
           V[0][st] = {"prob": start_p[st] * emit_p[st][obs[0]], "prev": None}
       # Run Viterbi when t > 0
       for t in range(1, len(obs)):
           V.append({})
           for st in states:
               max_tr_prob = max(V[t-1][prev_st]["prob"]*trans_p[prev_st][st] for prev_st in states)
               for prev_st in states:
                   if V[t-1][prev_st]["prob"] * trans_p[prev_st][st] == max_tr_prob:
                       max_prob = max_tr_prob * emit_p[st][obs[t]]
                       V[t][st] = {"prob": max_prob, "prev": prev_st}
                       break
       for line in dptable(V):
           print(line)
       opt = []
       # The highest probability
       max_prob = max(value["prob"] for value in V[-1].values())
       previous = None
       # Get most probable state and its backtrack
       for st, data in V[-1].items():
           if data["prob"] == max_prob:
               opt.append(st)
               previous = st
               break
       # Follow the backtrack till the first observation
       for t in range(len(V) - 2, -1, -1):
           opt.insert(0, V[t + 1][previous]["prev"])
           previous = V[t + 1][previous]["prev"]

       print('The steps of states are ' + ' '.join(opt) + ' with highest probability of %s' % max_prob)

def dptable(V):
       # Print a table of steps from dictionary
       yield( " ".join(("%12d" % i) for i in range(len(V))))
       for state in V[0]:
           yield ("%.7s: " % state + " ".join("%.7s" % ("%f" % v[state]["prob"]) for v in V))


str='TCGCGTGACGTGTCATGCTAGAAAAAAAAAGAACAAAAAGAAAAAAA'
obs=list(str)
#obs = ('A', 'T', 'G','C')
states = ('T', 'U')
start_p = {'U': 0.99, 'T': 0.01}
trans_p = {
   'U' : {'U': 0.8, 'T': 0.2},
   'T' : {'U': 0.001, 'T': 0.999}
   }
emit_p = {
   'U' : {'A': 0.25, 'T': 0.25, 'G': 0.25,'C':0.25},
   'T' : {'A': 0.925, 'T': 0.025, 'G': 0.025,'C':0.025}
   }
#decode
viterbi(obs,states,start_p,trans_p,emit_p)

让我们来看一下最后的解码结果

TCGCGTGACGTGTCATGCTAG21AAAAAAAAAAAAAAAAAAAAAAAAAA TCGCGTGACGTGTCATGCTAG21AAAAAAAAAGAACAAAAAGAAAAAAA UUUUUUUUUUUUUUUUUUUUU21TTTTTTTTTTTTTTTTTTTTTTTTTT3

可以看出模型很好的还原了read中3’UTR与PloyA-tail的关系,耶(^-^)V!为了鼓励自己下次还能想出好的方法,决定给自己晚上加个鸡腿😋。晚上去晚了鸡腿卖完了QAQ,晚饭后补


  1. 这个是假设的不影响后续的结果,也可以用B-W算法去学习这个值,或者直接估计为read中3’UTR序列的长度分之一
  2. 考虑了测序的错误,错误率为0.075
  3. U 3’UTR,T ployA-tail