Читать книгу Informatics and Machine Learning. From Martingales to Metaheuristics онлайн
91 страница из 101
ORFs are “open reading frames,” where the reference to what is open is lack of encounter with a stop codon when traversing the genome with a particular codon framing, e.g. ORFs are regions devoid of stop codons when traversed with the codon framing choice of the ORF. When referring to ORFs in most of the analysis we refer to ORFs of length 300 bases or greater. The restriction to larger ORFs is due to their highly anomalous occurrences and likely biological encoding origin (see ssss1), e.g. the long ORFs give a strong indication of containing the coding region of a gene. By restricting to transcripts with ORFs >= 300 in length we have a resulting pool of transcripts that are mostly true coding transcripts.
The above example shows a bootstrap finite state automaton (FSA) process on genomic data: first scan through the genomic data base‐by‐base and obtain counts on nucleotide pairs with different gap sizes between the nucleotides observed [1, 3]. This then allows a mutual information analysis on the nucleotide pairs taken at the different gap sizes. What is found for prokaryotic genomes (with their highly dense gene placement), is a clear signal indicating anomalous statistical linkages on bases three apart [1, 3]. What is discovered thereby is codon structure, where the coding information comes in groups of three bases. Knowing this, a bootstrap analysis of the 64 possible 3‐base groupings can then be done, at which point the anomalously low counts on “stop” codons is then observed. Upon identification of the stop codons their placement (topology) in the genome can then be examined and it is found that their counts are anomalously low because there are large stretches of regions with no stop codon (e.g. there are stop codon “voids,” known as “ORFs”). The codon void topologies are examined in a comparative genomic analysis in [1, 3]. As noted previously, the stop codons, which should occur every 21 codons on average if DNA sequence data was random, are sometimes not seen for stretches of several hundred codons (see ssss1).