Читать книгу Informatics and Machine Learning. From Martingales to Metaheuristics онлайн
54 страница из 101
ssss1 The start of the E. coli genome file, FASTA format.
---------------------- prog1.py addendum 5 ------------------- pattern = '[acgt]' result = re.findall(pattern, str) seqlen = len(result) # sequence = "" # sequence = sequence.join(result) # print(sequence) print("The sequence length of the Norwalk genome is: ") print(seqlen) a_count=0 c_count=0 g_count=0 t_count=0 for index in range(0,seqlen): if result[index] == 'a': a_count+=1.0 elif result[index] == 'c': c_count+=1.0 elif result[index] == 'g': g_count+=1.0 elif result[index] == 't': t_count+=1.0 else: print("bad char\n") norwalk_counts = np.array([a_count, c_count, g_count, t_count]) print(norwalk_counts) norwalk_probs = np.array([0.0,0,0,0]) norwalk_probs = count_to_freq(norwalk_counts) value = shannon(norwalk_probs) print(value) -------------------- end prog1.py addendum 5 -----------------
We now traverse the array of single acgt's extracted from the raw genome data file, and increment counters associated with the acgt's as appropriate. At the end we have gotten the needed counts, and can then use our subroutines to see what Shannon entropy occurs.