Читать книгу Informatics and Machine Learning. From Martingales to Metaheuristics онлайн
19 страница из 101
An example of a bootstrap FSA from genomic analysis is to first scan through a genome base‐by‐base and obtain counts on nucleotide pairs with different gap sizes between the nucleotides observed [1, 3]. This then allows a mutual information analysis on the nucleotide pairs taken at the different gap sizes (shown in Chatpers 3 and 4). What is found for prokaryotic genomes, with their highly dense gene placement, that is mostly protein coding (i.e. where there is little “junk” deoxyribonucleic acid (DNA) and no introns), is a clear signal indicating anomalous statistical linkages on bases three apart [1, 3, 60]. What is discovered thereby is codon structure, where the coding information comes in groups of three bases. Knowing this, a repeated pass (bootstrap) with frequency analysis of the 64 possible 3‐base groupings can then be done, at which point the anomalously low counts on “stop” codons is then observed. Upon identification of the stop codons their placement (topology) in the genome can then be examined and it is found that their counts are anomalously low because there are large stretches of regions with no stop codon (e.g. there are stop codon “voids,” known as open reading frames, or “ORF”s). The codon void topologies are examined in a comparative genomic analysis in [60] (and shown in ssss1). The stop codons, which should occur every 21 codons on average if DNA sequence data was random, are sometimes not seen for stretches of several hundred codons. For the genomic data we are finding the longer genes, whose anomalous non‐random DNA sequence is more distinctive the longer the gene‐coding region. This basic analysis can provide a gene‐finder on prokaryotic genomes that comprises a one‐page Python script that can perform with 90–99% accuracy depending on the prokaryotic genome (shown in ssss1). A second page of Python coding to introduce a “filter,” along the lines of the bootstrap learning process mentioned above, leads to an ab initio prokaryotic gene‐predictor with 98.0–99.9% accuracy. Python code to accomplish this is shown in ssss1. In this bootstrap acquisition process all that is used is the raw genomic data (with its highly structured intrinsic statistics) and methods for identifying statistical anomalies and informatics structural anomalies: (i) anomalously high mutual information is identified (revealing codon structure); (ii) anomalously high (or low) statistics on an attribute or event is then identified (low stop codon counts, lengthy stop codon voids); then anomalously high sub‐sequences (binding site motifs) are found in the neighborhood of the identified ORFs (used in the filter).