Читать книгу Informatics and Machine Learning. From Martingales to Metaheuristics онлайн
60 страница из 101
In the example in the previous section we left off with counts on all 4096 hexamers seen in a given genome. If we go from counts on substrings of length 6 to substrings of length 30 we run into a problem – there are now a million million million (1018) substrings to get counts on. No genome is even remotely this large, so when getting counts on substrings in this situation most substring counts will necessarily be zero. Due to the large number of substrings, this is often referred to as “the enumeration problem,” but since counts need only be maintained that are nonzero, we are bounded by genome size, for which there is no enumeration problem. The main mechanism for capturing count information on substrings without dedicated (array) memory, is by use of associative memory constructs, such as the hash variable, and this technique is employed in the code examples.
2.3 From Counts to Frequencies to Probabilities
The conventional relations on probabilities say nothing as to their interpretation. According to the Frequentist (frequency‐based) interpretation, probabilities are defined in terms of fractions of a set of observations, as the number of observations tends to infinity (where the LLN works to advantage). In practice, infinite observations are not done, and often only one observation is done (predicting the winner of a marathon, for example). In the case of one race, however, it seems intuitive that prior information would still be beneficial to predicting winners. With the formal introduction of prior probabilities we then arrive at the Bayesian interpretation. From the Bayesian perspective, prior probabilities can be encoded as “pseudocounts” in the frequentist framework (i.e. observation counts do not necessarily initialize from zero). In the computer implementations used here there are typically tuned/selected psuedocounts and minimum/maximum probability cutoffs, thus the implementations can be formally described on a Bayesian footing [1, 3].