Читать книгу Informatics and Machine Learning. From Martingales to Metaheuristics онлайн
53 страница из 101
The E. coli genome file is shown only for the first part (it is 4.6 Mb) in ssss1, where the key feature of the FASTA file is apparent on line 1, where a “>” symbol should be present, indicating a label (or comment – information that will almost always be present):
Python has a powerful regular expression module (named “re” and imported in the first code sample of prog1.py). Regular expression processing of strings of characters is a mini programming language in its own right, thus a complete list of the functionalities of the re module will not be given here. Focusing on the “findall” function, it does as its name suggests, it finds all entries matching the search string characters in an array in the order they are encountered in the string. We begin with the string comprising the data file read in the previous example. We now traverse the string searching for characters matching those specified in the pattern field. The array of a,c,g,t's that results has conveniently stripped any numbers, spaces, or line returns in this process, and a straightforward count can be done: