Personal tools

mGene at nGASP

Gene is a gene finding system that we have developed for the `nGASP competition`__. It was originally called G3A. Below are a few details how the gene finder works. A publication is in preparation. __ See the `ISMB poster abstract`__ for some more information. __


We tackle the gene prediction problem taking a two-layered approach. In a first step (layer 1) state-of-the-art kernel machines are employed to detect signal sequences in genomic DNA. In a second step (layer 2) their outputs are combined by a Hidden Semi Markov machine learning algorithm [1], [5] to predict whole gene structures. Major algorithms used are implemented in the SHOGUN toolbox [6] and combined and complemented with Matlab scripts.


Layer 1

Signal Sensors: We employ support vector machines (SVMs) as independent detectors of signal sites, i.e. transition sites between segments. Sensors for the following signals are incorporated:

  • transcriptions starts (promoter)
  • trans splice sites (start of genes in operons and trans-spliced genes)
  • translation initiation sites (around start codon, ATG)
  • splice donor sites (around GT or GC)
  • splice acceptor site (around AG)
  • translation termination (around stop codon: TAA, TAG, and TGA)
  • polyadenylation signals (around AATAAA and similar 6-mers)
  • cleavage sites (i.e., transcript end) downstream of polyA

All signal sensor SVMs use string kernels such as the Weighted Degree kernel [2] that operates directly on DNA sequences as input, some use additional Spectrum kernels [3].

Content sensors: We also use SVMs with spectrum kernels as content sensors to distinguish different segments by their oligo-nucleotide composition (we consider 3- to 6-mers). There are content sensors for

  • intergenic segments,
  • intercistronic segments (between genes in operons),
  • UTR segments,
  • introns, and
  • coding exons.

Additionally, we train a sensor that discriminates in-frame coding 3-mers and 6-mers from shifted (out-of-frame) subsequences.


The states (nodes) of our graphical model roughly correspond to the signals just described (for some signals several states exist -- e.g. for ACC and DON to model coding exons in different phases). Transitions (edges) correspond to segments starting and ending with corresponding signals, e.g. exons start from an ACC state and end in a DON state. The set of transitions captures valid gene structures (valid paths through the model); polyA and trans are optional states and can be bypassed.

Layer 2

In the second step, the outputs of all layer 1 sensors are combined in order to predict gene structures (segmentations). To this end, we extended the Hidden Semi Markov SVM framework [1], [5].

Our algorithm learns transformations (piecewise linear functions), which can be seen as a weighting of the contributions of all layer 1 outputs as well as segment length contributions in order to obtain a global score.

The learning algorithm follows the large margin paradigm, maximizing the difference between the score of the true segmentation and any other (wrong) segmentation.

This approach lead to and V0 is a preliminary version of V1.

Alternative splicing events

Alternative splicing events are predicted following [4]. For each class of events (exon skipping, intron retention, alternative 3' and 5'), the top 1% are included in gene predictions in


Conservation information is incorporated in layer 1 as additional features for some signal sensor SVMs (TIS, STOP) and for all content sensor SVMs. Each genome position is assigned a conservation score according to the aligned nucleotides. Conservation scores are mapped to a discrete alphabet and additional string kernels are trained on these conservation sequences.

This approach lead to


ESTs are aligned to genomic regions using BLAT [7]. Output scores of layer 1 splice site sensors are set to the maximum value where there is an EST confirming that splice site. In contrast, scores of splice sites that disagree with EST alignments are set to the minimal value. Afterwards, gene predictions with the best category I model are re-computed without any adjustment of layer 2 parameters.

This approach lead to and, where the second version is made more sensitive (finds several hundred more genes). The version that additionally includes alt-splice predictions is


In case of comments, problems, questions etc. feel free to contact Gunnar Rätsch. The following people have contributed (first four equally):


[1](1, 2) Gunnar Raetsch and Soeren Sonnenburg. Large Scale Semi Hidden Markov SVMs. In Advances in Neural Information Processing Systems 19, Cambridge, MA, 2006. MIT Press.
[2]Soeren Sonnenburg, Gunnar Raetsch, and Bernhard Schoelkopf. Large Scale Genomic Sequence SVM Classifiers. In Proceedings of the 22nd International Machine Learning Conference. ACM Press, 2005.
[3]Soeren Sonnenburg, Alexander Zien, and Gunnar Raetsch. ARTS: Accurate Recognition of Transcription Starts in Human. Bioinformatics, 22(14):e472-480, 2006.
[4]Gunnar Raetsch, Soeren Sonnenburg and Bernhard Schoelkopf. RASE: Recognition of Alternatively Spliced Exons in C. elegans. Bioinformatics, Proc. ISMB, 2005. ISCB.
[5](1, 2) Gunnar Raetsch, Soeren Sonnenburg, Jagan Srinivasan, Hanh Witte, Klaus-Robert Mueller, Ralf J Sommer, Bernhard Schoelkopf, Improving the Caenorhabditis elegans Genome Annotation Using Machine Learning, PLoS Computational Biology, 2006 (in press)
[6]Soeren Sonnenburg, Gunnar Raetsch, Konrad Rieck, Large Scale Learning with String Kernels. In: Bottou L, Chapelle O, DeCoste D, Weston J, editors, Large Scale Kernel Machines, 2007. MIT Press. pp. 73-104. In press.
[7]James W. Kent, BLAT -- The BLAST-Like Alignment Tool, Genome Research 12: 656-664, 2002
Document Actions