Читать книгу Informatics and Machine Learning. From Martingales to Metaheuristics онлайн
30 страница из 101
Source: Based on Winters‐Hilt [1–3].
In adopting any model with “more parameters,” such as a HMMBD over a HMM, there is potentially a problem with having sufficient data to support the additional modeling. This is generally not a problem in any HMM model that requires thousands of samples of non‐self transitions for sensor modeling, such as for the gene‐finding that is described in what follows, since knowing the boundary positions allows the regions of self‐transitions (the durations) to be extracted with similar sample number as well, which is typically sufficient for effective modeling of the duration distributions in a HMMD.
Improvement to overall HMM application rests not only with the aforementioned improvements to the HMM/HMMBD, but also with improvements to the hidden state model and emission model. This is because standard HMMs are at low Markov order in transitions (first) and in emissions (zeroth), and transitions are decoupled from emissions (which can miss critical structure in the model, such as state transition probabilities that are sequence dependent). This weakness is eliminated if we generalize to the largest state‐emission clique possible, fully interpolated on the data set, as is done with the generalized‐clique HMM, where gene finding is performed on the Caenorhabditis elegans genome. The clique generalization improves the modeling of the critical signal information at the transitions between exon regions and noncoding regions, e.g. intron and junk regions. In doing this we arrive at a HMM structure identification platform that is novel, and robustly performing, in a number of ways.