GeneTack: tools for frameshift prediction

Table of Content


Tools for frameshift prediction

The GeneTack server contains a number of tools for frameshift identification in nucleotide sequences. There are four main programs -- GeneTack-GM, GeneTack-Prok, GeneTack-Euk and MetaGeneTack.

GeneTack-GM is a combination of frameshift prediction program GeneTack and a self-training gene prediction program GeneMarkS. GeneTack-GM could be used to predict frameshfits in long prokaryotic sequences (longer 300kB). The model parameters are automatically generated by a self-training program GeneMarkS. GeneTack-GM also includes a number of filters to remove false positive predictions.

GeneTack-Prok and GeneTack-Euk can be used to analyze shorter prokaryotic and eukaryotic sequences with length insufficient for self-training. Eukaryotic sequences must be intronless, e.g. mRNAs or ESTs can be used. Both programs feature a number of pre-built species specific models. A user should choose the one that corresponds to the input sequence. No filters are applied to the frameshifts predicted by these two programs.

MetaGeneTack is an ab initio frameshift finder designed for metagenomic data. It uses heuristic method to infer model parameters in short sequences. MetaGeneTack also incorporates gene prediction results from MetaGeneMark; so it is able to provide a full list of genes with or without FS for metagenomic sequence annotation.

Prokaryotic frameshift database

GeneTack was applied to 1,106 prokaryotic genome sequences. Overall, 206,991 frameshifts have been predicted. Since the GeneTack performance, as assessed earlier, delivers 85.8% sensitivity and 68.2% specificity in frameshift detection, we expect that almost 1/3 of the predictions are false positives.

Our goals were i/ to filter out false positive predictions and ii/ to determine a nature of true positive predictions, i.e. to classify them as ones caused by either random sequencing error or representing a true sequence feature which could inactivate a gene (pseudogene) or be involved in gene regulation (programmed frameshift).

All genes containing predicted frameshifts (fs-genes) were conceptually translated into proteins (fs-proteins) and two types of validations were performed. The fs-proteins were used as queries for the BLASTp search against NCBI nr database to detect homologous proteins that combine translations of both sides of the broken frame. To avoid cases of gene fusion we also located Pfam domains in the fs-proteins trying to find cases where domain is located on the junction of two ORFs. The fs-proteins validated by both BLASTp and Pfam are likely to be GeneTack true positive predictions.

Next, a database of all fs-proteins was built and "all-against-all" BLASTp search was done in order to clump together homologous frameshift events; as a result, 102,731 fs-proteins (50%) were grouped into 19,666 clusters. 104,260 frameshifts that did not form clusters could be false positive artefacts or correspond to authentic indel mutations or sequencing errors. Notably, 6,042 fs-proteins were validated by both BLASTp and Pfam and are likely to be true positive predictions. To reveal the true nature of these frameshifts the corresponding genome regions should be resequenced.

Clustered fs-genes show conservation in homologous genes and correspond to a conserved frameshift mutation in the lineage. Still, false positive frameshift predictions related to adjacent/overlapping gene pairs with conserved co-location could also form clusters.

Presence of a specific conserved motif situated close to the frameshift site is an important feature of programmed frameshift clusters discriminating them from clusters of fs-proteins derived from pseudogenes and from clusters of false positive fs-proteins. Not only a motif itself but also its phasing with respect to the reading frame is crucial for proper functioning of a programmed frameshift. The phasing was taken into account in a new algorithm that is more suitable for programmed frameshifts motif identification than standard motif searching algorithm, such as MEME. Using this algorithm we identified that the most common programmed frameshift motifs are TA_AAA_A, A_AAA_AA, AAA_AAA_, TTA_AAA_ and CTT_TGA (underscores indicate frame of the upstream ORF). These motifs were previously reported to cause frameshifting. So, 239 clusters (containing 5,632 fs-genes) of programmed frameshift were identified. We plan to incorporate the predicted programmed frameshifts in the Recode database.

Another 4,010 (2,810 + 1,200) clusters are likely to be clusters of pseudogenes with indel mutations. In total these clusters contain 13,812 fs-genes while only 5,484 of them are annotated as pseudogenes in GenBank.

We did not specify the nature of many predicted frameshifts because we did not want GeneTack false positives to appear in our final results. We prefer to leave a frameshift unclassified if we are not sure if it is true positive prediction.

Eukaryotic frameshift database

Eukaryotic genes with frameshifts were identified in mature mRNA sequences. Several HMM models were generated for each eukaryotic genus. Each model was generated by a self-training algorithm, a version of GeneMarkS, from a set of mRNAs with a close GC% content.

Currently the database contains fs-genes from 100 eukaryotic species.

HMM file format specification

A file with ".hmm_def" extention consists of sections that contain definitions or other sections inside.
Sections have XML like format - there is an open tag <SECTION_NAME> at the beginning of 
each section and there must be a close tag </SECTION_NAME> at the end of the section.
Definitions have <name> = <value> format.

The line first non-space character of which is '#' is considered as comment line and is 
ommited. Empty lines and lines with spaces only are also ommited.

File consists of 5 main sections:

 1. FILE_INFO       -- contains technical information about the file.
                       DO NOT MODIFY DATA IN THIS SECTION!

 2. HMM_MODEL       -- here you can specify name and description of your model.
                       This information will appear in the output file so you will
                       know which HMM you used.

 3. EMISSION_LIST   -- list of emissions that is used in the HMM. 
                       Each ITEM of the list contains the following elements:
                           * id            -- unique number (among all id-s in this section) of
                                              the item that will be used to identify it below
                           * name          -- name of this ITEM
                           * PROBABILITIES -- two column table: STRING <spaces> FREQUENCY

 4. STATE_LIST      -- list of all states that the HMM consists of.
                       Each ITEM of the list contains the following elements:
                           * id           -- unique number (among all id-s in this section) of
                                             the item that will be used to identify it below
                           * name         -- name of this ITEM
                           * periodicity  -- state periodicity - positive integer
                           * emission_set -- number of set ids (delimited by commas) here MUST
                                             be equal to the periodicity value above

 5. TRANSITION_LIST -- list of all transitions between states in the HMM
                       Each ITEM of the list contains the following elements:
                           * from_state              -- from state ID
                           * to_state                -- to state ID
                           * probability             -- probability of the transition
                           * EMISSION_STR_EXCEPTIONS -- [optional] exceptions for transition
                                                        probability for particular emission strings.
                                                        Format: STRING <spaces> PROBABILITY

Format limitations
 * You can NOT change the order of sections and definitions anywhere in the file
 * All sections and definitions are REQURED. You can't delete any section or 
   definition in it.
 * You can NOT write comments on the definition line. So the following 
   definition is WRONG:

     version = 0.01   # File version

   The comments will be interpreted as part of <value> that will probably
   lead to program crash. 

 * First '=' sign in each definition is interpreted as delimiter between
   <name> and <value>. Because of this <name> must NOT contain '=' sign,
   but <value> can contain any number of '=' signs.

 * All ids are consequtive non negative integers starting with 0
 * All emission probabilities MUST have the same order, i.e. the same number of rows
   in each PROBABILITIES section

 * Probabilities in the PROBABILITIES section are observed probabilities, not
   conditional probabilities. Sum of all probabilities in each section must be equal to 1.

 * Sum of all outcoming probabilities for each state must be equal 1