Читать книгу Informatics and Machine Learning. From Martingales to Metaheuristics онлайн
94 страница из 101
Before moving on to more sophisticated gene structure identification (ssss1), let us first consider the multi‐frame and two‐strand aspect of the genomic information and what this might mean for the “topology” or overlap placement of coding regions. To recap, smORF offers information about ORFs, and tallies information about other such codon void regions (an ORF is a void in three codons: TAA, TAG, TGA). This allows for a more informed selection process when sampling from a genome, such that non‐overlapping gene starts can be cleanly and unambiguously sampled. Furthermore, overlapping ORF coding regions can be identified and enumerated (see ssss1 and ssss1).
The goal with smORF was, initially, to identify key gene structures (e.g. stop codons, etc.) and use only the highest confidence examples to train profilers. Once this was done, Markov models (MMs) were (bootstrap) constructed on the suspected start/stop regions and coding/noncoding regions. The algorithm then iterated again, informed with the MM information, and partly relaxes the high fidelity sampling restrictions (essentially, the minimum allowed ORF length is made smaller). A crude gene‐finder was then constructed on the high fidelity ORFs by use of a very simple heuristic: scan from the start of an ORF and stop at the first in‐frame “atg” (to be implemented in ssss1). This analysis was applied to the Vibrio cholerae genome (Chr. I). 1253 high fidelity ORFs were identified out of 2775 known genes. This first‐“atg” heuristic provided a gene prediction accuracy of 1154/1253 (92.1% of predictions of gene regions were exactly correct). If small shifts are allowed in the predicted position of the start‐codon relative to the first‐“atg” (within 25 bases on either side), then prediction accuracy improves to 1250/1253 (99.8%). This actually elucidates a key piece of information needed to improve such a prokaryotic gene‐finder: information is needed to help identify the correct start codon in a 50‐base window from the first ATG. Such information exists in the form of DNA motifs corresponding to the binding footprint of regulatory biomolecules (that play a role in transcriptional or translational control). Further bootstrap refinements along these lines are done in ssss1 to produce an ab initio prokaryotic gene finder with 99.9% or better accuracy.