Читать книгу Informatics and Machine Learning. From Martingales to Metaheuristics онлайн
93 страница из 101
3.3.1 Ab initio Learning with smORF’s, Holistic Modeling, and Bootstrap Learning
In work on prokaryotic gene prediction (V. cholera in what follows), a program (smORF) was developed for an extended ORF characterization (to characterize “some more ORFs” with different trinucleotide delimiters than stops). Using that software with a simple start‐of‐coding heuristic it was possible to establish good gene prediction for ORFs of length greater than 500 nucleotides. The smORF gene identification was used in a bootstrap gene‐annotation process (where no initial training data was provided). Part of the functionality for smORF is encompassed in prog2.py program described thus far. The strength of the gene identification was then improved by use of a gap‐interpolating‐Markov‐model (gIMM’s to be described in ssss1). When applied to the identified coding regions (most of the >500 length ORFs), six gIMMs were used (one for each frame of the codons, with forward and backward read senses). If poorly gIMM‐scoring coding regions were rejected, performance improved, with results slightly better than those of the early Glimmer gene‐prediction software [125] , where an interpolating Markov model was used (but not generalized to permit gaps). More recent versions of Glimmer incorporate start‐codon modeling in order to strengthen predictions. One of the benefits of the gap‐interpolating generalization is that it permits regulatory motifs to be identified, particularly those sharing a common positional alignment with the start‐of‐coding region. Using the bootstrap‐identified genes from the smORF‐based gene‐prediction (including mis‐calls) as a training set permitted an unsupervised search for upstream regulatory structure. The classic Shine‐Dalgarno sequence (the ribosome binding site) was found to be the strongest signal in the 30‐base window upstream from the start codon. Similar results will be found with the full gene‐finder example in ssss1.