隐马尔科夫模型与PloyA尾长度检测
张宇亮 / 2017-09-21
前几天一同学遇到了一个问题:即在mRNA的3’UTR的ployA尾部加上接头,通过二代测序的方法检测polyA尾部的长度,但由于二代测序的技术问题,会存在以下问题,即最后的PloyA由于是一连串的A会出现测不准的现象,这样就给后续分析带来了极大的困难,之前和大家一起讨论时候听到了这个问题,当时提了几个解决方案,可行度不是太强,前几天在看隐马尔科夫模型时突然想到了一个点子,如下: 先重新描述一下问题,假设存在这样一个序列,3’UTR序列接了一串PolyA
TCGCGTGACGTGTCATGCTAGAAAAAAAAAAAAAAAAAAAAAAAAAA
现在由于测序的错误变成了以下序列
TCGCGTGACGTGTCATGCTAGAAAAAAAAAGAACAAAAAGAAAAAAA
那么如何修正这些错误的碱基并得到polyA-tail的实际长度呢?假设存在两个隐状态,’ployA-tail’和’UTR’,显状态为A、T、G、C,隐状态之间的转移概率为P(‘UTR’–>‘ployA-tail’)=0.21, P(‘ployA-tail’–>‘UTR’)=0.001几乎不发生。在PolyA隐状态向显状态A、T、G、C的转移(或发散)概率为0.925、0.025、0.025、0.0252,建立以上模型之后就可以使用veterbi算法进行解码了,Python代码如下:
# -*- coding: utf-8 -*-
def viterbi(obs, states, start_p, trans_p, emit_p):
V = [{}]
for st in states:
V[0][st] = {"prob": start_p[st] * emit_p[st][obs[0]], "prev": None}
# Run Viterbi when t > 0
for t in range(1, len(obs)):
V.append({})
for st in states:
max_tr_prob = max(V[t-1][prev_st]["prob"]*trans_p[prev_st][st] for prev_st in states)
for prev_st in states:
if V[t-1][prev_st]["prob"] * trans_p[prev_st][st] == max_tr_prob:
max_prob = max_tr_prob * emit_p[st][obs[t]]
V[t][st] = {"prob": max_prob, "prev": prev_st}
break
for line in dptable(V):
print(line)
opt = []
# The highest probability
max_prob = max(value["prob"] for value in V[-1].values())
previous = None
# Get most probable state and its backtrack
for st, data in V[-1].items():
if data["prob"] == max_prob:
opt.append(st)
previous = st
break
# Follow the backtrack till the first observation
for t in range(len(V) - 2, -1, -1):
opt.insert(0, V[t + 1][previous]["prev"])
previous = V[t + 1][previous]["prev"]
print('The steps of states are ' + ' '.join(opt) + ' with highest probability of %s' % max_prob)
def dptable(V):
# Print a table of steps from dictionary
yield( " ".join(("%12d" % i) for i in range(len(V))))
for state in V[0]:
yield ("%.7s: " % state + " ".join("%.7s" % ("%f" % v[state]["prob"]) for v in V))
str='TCGCGTGACGTGTCATGCTAGAAAAAAAAAGAACAAAAAGAAAAAAA'
obs=list(str)
#obs = ('A', 'T', 'G','C')
states = ('T', 'U')
start_p = {'U': 0.99, 'T': 0.01}
trans_p = {
'U' : {'U': 0.8, 'T': 0.2},
'T' : {'U': 0.001, 'T': 0.999}
}
emit_p = {
'U' : {'A': 0.25, 'T': 0.25, 'G': 0.25,'C':0.25},
'T' : {'A': 0.925, 'T': 0.025, 'G': 0.025,'C':0.025}
}
#decode
viterbi(obs,states,start_p,trans_p,emit_p)
让我们来看一下最后的解码结果
TCGCGTGACGTGTCATGCTAG21AAAAAAAAAAAAAAAAAAAAAAAAAA TCGCGTGACGTGTCATGCTAG21AAAAAAAAAGAACAAAAAGAAAAAAA UUUUUUUUUUUUUUUUUUUUU21TTTTTTTTTTTTTTTTTTTTTTTTTT3
可以看出模型很好的还原了read中3’UTR与PloyA-tail的关系,耶(^-^)V!为了鼓励自己下次还能想出好的方法,决定给自己晚上加个鸡腿😋。晚上去晚了鸡腿卖完了QAQ,晚饭后补。
