Читать книгу Informatics and Machine Learning. From Martingales to Metaheuristics онлайн
82 страница из 101
3.1.5 Information Measures Recap
σ = –∑xp(x)log(p(x))p(x)ρ = ∑xp(x) log(p(x)/q(x))μ = ∑x∑yp(xy) log(p(xy)/p(x)p(y))
The next program, cleverly named prog2.py, will build off the code devised previously, with the file i/o operation now “lifted” into a subroutine for safer encapsulation (to avoid scope errors, etc.) and to avoid the confusing clutter of copying and pasting such a large block of code repeatedly that would be required otherwise. By now, this has hopefully made a convincing case for why subroutines are a big deal in the evolution of software engineering constructs (and the computer languages that implement them). Further discussion is given in the comments in the code.
------------------------ prog2.py --------------------------- #!/usr/bin/python import numpy as np import math import re # from prior code we carry over the subroutines: # shannon, count_to_freq, Shannon_order; # with prototypes: # def shannon( probs ) with usage: # value = shannon(probs) # print(value) # # def count_to_freq( counts ) with usage # probs = count_to_freq(rolls) # print(probs) # # def shannon_order( seq, order ) with usage: # order = 8 # maxcounts = 4**order # print "max counts at order", order, "is =", maxcounts # val = math.log(maxcounts) # shannon_order(sequence,order) # shannon_order prints entropy # print "The max entropy would be log(4**order) = ", val # New code is now created to have subroutines for text handling. # There are two types of text-read, one for genome data in "fasta" # format (gen_fasta_read) and one for generic format (gen_txt_read): def gen_txt_read( text ): if (text == ""): text = "Norwalk_Virus.txt" fo = open(text, "r+") str = fo.read() fo.close() pattern = '[acgt]' result = re.findall(pattern, str) # seqlen = len(result) return result #usage null="" gen_array = gen_txt_read(null) # defaults, uses Norwalk_Virus.txt genome sequence = "" sequence = sequence.join(gen_array) Norwalk_sequence = sequence; seqlen = len(gen_array) print "The sequence length of the Norwalk genome is:", seqlen def gen_fasta_read( text ): if (text == ""): text = "EC_Chr1.fasta.txt" slurp = open(text, 'r') lines = slurp.readlines() sequence = "" for line in lines: pattern = '>' test = re.findall(pattern,line) testlen = len(test) if testlen>0: print "fasta comment:", line.strip() else: sequence = sequence + line.strip() slurp.close() pattern = '[acgtACGT]' result = re.findall(pattern, sequence) return result #usage null="" fasta_array = gen_fasta_read(null) # defalus, uses e.coli genome sequence = "" sequence = sequence.join(fasta_array) EC_sequence = sequence seqlen = len(fasta_array) print "The sequence length of the e. coli Chr1 genome is:", seqlen print "Doing order 8 shannon analysis on e-coli:" order = 6 shannon_order(EC_sequence,order) --------------------- prog2.py end --------------------------