Читать книгу Informatics and Machine Learning. From Martingales to Metaheuristics онлайн

90 страница из 101


ORF encoding structure is revealed in the V. cholera genome by gaps between stop codons in the genomic sequence

3.3 ORF Discovery from Long‐Tail Distribution Anomaly

Once codon grouping is revealed, where a frequency analysis on codons on the stop codons (TAA, TAG, TGA) shows they are rare. Focusing on the stop codons it is easily found that the gaps between stop codons can be quite anomalous compared to the gaps between other codons (see prog2.py addendum 6):

------------------- prog2.py addendum 6 --------------------- # need gap stats between codons, the stop codon group (orf_finder), and # the 'common' reference group (corf_finder): def orf_finder ( seq, frame ): gapcounts = {} edgecounts = {} pattern = '[acgtACGT]' result = re.findall(pattern, seq) seqlen = len(seq) output_fh = open("orf_output", 'w') output_fh.close() oldindex=0 oldcodon="" for index in range(frame,seqlen-2): rem = (index+3-frame)%3 if rem!=0: continue codon = result[index]+result[index+1]+result[index+2] if (codon!="TAA" and codon!="TAG" and codon!="TGA"): continue else: gap = index - oldindex if gap%3!=0: print "gap=", gap, "index=", index break quant = 100 bin = gap/quant if oldindex!=0: if bin in gapcounts: gapcounts[bin]+=1 else: gapcounts[bin]=1 if oldcodon!="": edge=oldcodon + codon if edge in edgecounts: edgecounts[edge]+=1 else: edgecounts[edge]=1 slice = result[oldindex: index+2+1] output_fh = open("orf_output", 'a') slicejoin = "" slicejoin = slicejoin.join(slice) orfline = slicejoin + '\n' output_fh.write(orfline) oldindex=index oldcodon=codon npcounts = np.empty((0)) for i in sorted(gapcounts): npcounts = np.append(npcounts,gapcounts[i]+0.0) print "gapbin", i, "count =", gapcounts[i] ecounts = np.empty((0)) for i in sorted(edgecounts): ecounts = np.append(ecounts,edgecounts[i]+0.0) print "edgecodon", i, "count =", edgecounts[i] probs = count_to_freq(npcounts) #usage: orf_finder(EC_sequence,0) # def corf_finder ( seq, frame ): # same except not 'stop' boundariy condition but 'common': # if (codon!="AAA" and codon!="GAA" and codon!="GAT") #usage: #corf_finder(EC_sequence,0) --------------- prog2.py addendum 6 end ---------------------

Правообладателям