Читать книгу Informatics and Machine Learning. From Martingales to Metaheuristics онлайн
72 страница из 101
13 2.13 You have a genomic sequence of length L. (For DNA genomes you have approximately 10**4 for viruses, 10**6 for bacteria, and 10**9 for mammals.) A typical analysis is to get counts on subsequences of length N within the full sequence L, where there are L − N + 1 subsequences of length N (by sliding a window of width N across the length L sequence and taking the window samples accordingly). The number of possible subsequences of length N grows exponentially with increase in that length. For DNA subsequences of length six bases, the 6mers, with four base possibilities, {a,c,g,t}, there are thus 4**6 = 4096 possible 6mers. If the 6mers are equally probable, then in the approximate 10 000 length of a virus each 6mer might be seen a couple times (10 000/4096 to be precise), while a particular 6mer can be seen millions of times in mammalian genomes. Sounds fine so far, but now consider an analysis of 25mers…. The possible 25mers number 4**25 = 2**50 = (2**10)**5 = 1024**5 = 10**15. So, a million billion possibilities…. It turns out that DNA information does not have subsequences with approximately equal statistical counts (equal probabilities), but, instead, is highly structured with a variety of overlapping encoding schemes, so has subsequences with very unequal statistics. The vast majority of the 25mer subsequences, in fact, will have zero counts such that enumeration of the possibilities ahead of time in an array data‐structure is not useful or even possible in some cases, which then leads to associative arrays in this context as shown in the sample code. Do a 25mer analysis on bacterial genome (get from genbank, like E. coli). What is the highest count 25mer subsequence?