Читать книгу Informatics and Machine Learning. From Martingales to Metaheuristics онлайн

74 страница из 101

In this chapter, we start with a description of information entropy and statistical measures (ssss1). Using these measures we then examine “raw” genomic data. No biology or biochemistry knowledge is needed in doing this analysis and yet we almost trivially rediscover a three‐element encoding scheme that is famous in biology, known as the codon. Analysis of information encoding in the four element {a, c, g, t} genomic sequence alphabet is about as simple as you can get (without working with binary data), so it provides some of the introductory examples that are implemented. A few (simple) statistical queries to get the details of the codon encoding scheme are then straightforward (ssss1). Once the encoding scheme is known to exist, further structure is revealed via the anomalous placement of “stop” codons, e.g. anomalously large open reading frames (ORFs) are discovered. A few more (simple) statistical queries from there, and the relation of ORFs to gene structure is revealed (ssss1). Once you have a clear structure in the sequential data that can be referenced positionally, it is then possible to gather statistical information for a Markov model. One example of this is to look at the positional base statistics at various positions “upstream” from the start codon. We thereby identify binding sites for critical molecular interaction in both transcription and translation. Since the Markov model is needed in analysis of sequential processes in general for what is discussed in later chapters (ssss1 and ssss1 in particular), a review of Markov models, and some of their specializations, are given in ssss1 (ssss1 and ssss1 covers Hidden Markov models, or HMMs).