Читать книгу Informatics and Machine Learning. From Martingales to Metaheuristics онлайн
58 страница из 101
2.1.1 Sample Size Complications
The 6‐nucleotide statistics analyzed in prog1.py in the preceding is typically called a hexamer statistical analysis. Where the window‐size for extracting the substrings has “‐mer” appended, thus six‐mer or hexamer. The term “‐mer” comes from oligomer, a polymer containing a small number of monomers in its specification. In the case of the hexamers we saw that there were 4096 possible hexamers, or length six substrings, when the “alphabet” of monomer types consists of four elements: a,c,g, and t. In other words, there are 46 = 4096 such substrings. In the Norwalk virus analysis this large number of different things to count, versus sample size overall, raises sampling questions. The Norwalk virus has a genome that is only 7654 nucleotides long. As we sweep the six‐base window over that string to extract all of the hexamer counts we then obtain only 7654 − 5 = 7649 hexamer samples! Even with uniform distribution we will be getting barely two counts for most of the different hexamer types! Limitations due to sample size play a critical role in these types of analysis.