Читать книгу Informatics and Machine Learning. From Martingales to Metaheuristics онлайн
92 страница из 101
Not surprisingly, longer genes stand out clearly in this process, since their anomalous, clearly nonrandom DNA sequence, is being maintained as such, and not randomized by mutation, (as this would be selected against in the survival of the organism that is dependent on the gene revealed).
The preceding basic analysis can provide a gene‐finder on prokaryotic genomes that comprises a one‐page Python script that can perform with 90–99% accuracy depending on the prokaryotic genome. A second page of Python coding to introduce a “filter,” along the lines of the bootstrap learning process mentioned above, leads to an ab initio prokaryotic gene‐predictor with 98.0–99.9% accuracy. Python code to accomplish this is shown in what follows (ssss1). In this process, all that is used is the raw genomic data (with its highly structured intrinsic statistics) and methods for identifying statistical anomalies and informatics structural anomalies: (i) anomalously high mutual information is identified (revealing codon structure); (ii) anomalously high (or low) statistics on an attribute or event is then identified (low stop codon counts, lengthy stop codon voids); then anomalously high sub‐sequences (binding site motifs) are found in the neighborhood of the identified ORFs (used in the filter).