Читать книгу Informatics and Machine Learning. From Martingales to Metaheuristics онлайн
83 страница из 101
As mentioned previously, when comparing two probability distributions on the same set of outcomes, it is natural to ask if they can be compared in terms of the difference in their scalar‐valued Shannon entropies. Similarly, there is the standard manner of comparing multicomponent features by treating them as points in a manifold and performing the usual Euclidean distance calculation generalized to whatever dimensionality of the feature data. Both of these approaches are wrong, especially the latter, when comparing discrete probability distributions (of the same dimensionality). The reason being, when comparing two discrete probability distributions there are the additional constraint on the probabilities (sum to 1, etc.), and the provably optimal difference measure under these circumstances, as described previously, is relative entropy. This will be explored in Exercise 3.5, so some related subroutines are included in the first addendum to prog2.py:
--------------------- prog2.py addendum 1-------------------- # We can use the Shannon_order subroutine to return a probability array # for a given sequence. Here are the probability arrays on 3-mers # (ordered alphabetically): Prob_Norwalk_3mer = shannon_order(Norwalk_sequence,3) Prob_EC_3mer = shannon_order(EC_sequence,3) # the standard Euclidean distance and relative entropy are given next def eucl_dist_sq ( P , Q ): Pnum = len(P) Qnum = len(Q) if Pnum != Qnum: print "error: Pnum != Qnum" return -1 euclidean_distance:squared = 0 for index in range(0, Pnum): euclidean_distance:squared += (P[index]-Q[index])**2 return euclidean_distance:squared # usage value = eucl_dist_sq(Prob_Norwalk_3mer,Prob_EC_3mer) print "The euclidean distance squared between EC and Nor 3mer probs is", value # P and Q are probability arrays, meaning components are positive definite # and sum to one. If P and Q are proability arrays, can compare them in # terms of relative entropy (not Euclidean distance): def relative_entropy ( P , Q ): Pnum = len(P) Qnum = len(Q) if Pnum != Qnum: print "error: Pnum != Qnum" return -1 rel_entropy = 0 for index in range(0, Pnum): rel_entropy += P[index]*math.log(P[index]/Q[index]) return rel_entropy # usage value1 = relative_entropy(Prob_Norwalk_3mer,Prob_EC_3mer) print "The relative entropy between Nor and EC 3mer probs is", value1 value2 = relative_entropy(Prob_EC_3mer,Prob_Norwalk_3mer) print "The relative entropy between EC and Nor 3mer probs is", value2 sym = (value1+value2)/2 print "The symmetrized relative entropy between EC and Nor 3mer probs is", sym -------------------- prog2.py addendum 1end ------------------