Читать книгу Informatics and Machine Learning. From Martingales to Metaheuristics онлайн

55 страница из 101

Note on the informatics result: notice how the Shannon entropy for the frequencies of {a,c,g,t} in the genomic data differs only slightly from the Shannon entropy that would be found for a perfectly random, uniform, probability distribution (e.g. log(4)). This shows that although the genomic data is not random, i.e. the genomic data holds “information” of some sort, but we are only weakly seeing it at the level of single base usage.

In order to see clearer signs of nonrandomness, let us try evaluating frequencies at the base‐pair, or dinucleotide, level. There are 16 (4 × 4) dinucleotides that we must now get counts on:

------------------ prog1.py addendum 6 ----------------------- di_uniform = [1.0/16]*16 stats = {} for i in result: if i in stats: stats[i]+=1 else: stats[i]=1 #for i in sorted(stats, key=stats.get): for i in sorted(stats): print("%dx'%s'" % (stats[i],i)) stats = {} for index in range(1,seqlen): dinucleotide = result[index-1] + result[index] if dinucleotide in stats: stats[dinucleotide]+=1 else: stats[dinucleotide]=1 for i in sorted(stats): print("%dx'%s'" % (stats[i],i)) -------------------- end prog1.py addendum 6 -----------------

Правообладателям