Methods

We tackle the gene prediction problem taking a two-layered approach. In a first step, state-of-the-art kernel machines are employed to detect signal sequences in genomic DNA (like splice sites or transcription start sites) and to discriminate the content of different DNA sequences (like coding exons, introns, etc.). In a second step their outputs are combined to predict whole gene structures. In this step we use a discriminative training approach based on `Hidden semi Markov support vector machines`.

Layer 1

Signal Sensors: We employ support vector machines (SVMs) as independent detectors of signal sites, i.e. transition sites between segments. These signal detectors include transcriptions starts, translation initiation sites, donor and acceptor splice sites, translation termination and cleavage sites, and optionally trans splice sites and polyadenylation signals. All signal sensor SVMs use string kernels such as the Weighted Degree kernel [2] that operates directly on DNA sequences as input, some use additional Spectrum kernels [3].

Content sensors: We also use SVMs with spectrum kernels as content sensors to distinguish different segments by their oligo-nucleotide composition (we consider 3- to 6-mers). There are content sensors for intergenic segments, UTR segments, introns, coding exons and optionally intercistronic segments. Additionally, we train a sensor that discriminates in-frame coding 3-mers and 6-mers from shifted (out-of-frame) subsequences.

Layer 2

In the second step, the outputs of all layer 1 sensors are combined in order to predict gene structures (segmentations). To this end, we extended the hidden semi-Markov SVM framework [1], [5]. The states (nodes) of our graphical model roughly correspond to the signals just described (for some signals several states exist -- e.g. for ACC and DON to model coding exons in different phases). Transitions (edges) correspond to segments starting and ending with corresponding signals, e.g. exons start from an ACC state and end in a DON state. The set of transitions captures valid gene structures (valid paths through the model); polyA and trans are optional states and can be bypassed. Our algorithm learns transformations (piecewise linear functions), which can be seen as a weighting of the contributions of all layer 1 outputs as well as segment length contributions in order to obtain a global score. The learning algorithm follows the large margin paradigm, maximizing the difference between the score of the true segmentation and any other (wrong) segmentation.

References

[1]	Gunnar Rätsch and Sören Sonnenburg. Large Scale Hidden Semi-Markov SVMs. In Advances in Neural Information Processing Systems 19, Cambridge, MA, 2006. MIT Press.

[2]	Sören Sonnenburg, Gunnar Rätsch, and Bernhard Schölkopf. Large Scale Genomic Sequence SVM Classifiers. In Proceedings of the 22nd International Machine Learning Conference. ACM Press, 2005.

[3]	Sören Sonnenburg, Alexander Zien, and Gunnar Rätsch. ARTS: Accurate Recognition of Transcription Starts in Human. Bioinformatics, 22(14):e472-480, 2006.

[4]	Gunnar Rätsch, Sören Sonnenburg and Bernhard Schölkopf. RASE: Recognition of Alternatively Spliced Exons in C. elegans. Bioinformatics, Proc. ISMB, 2005. ISCB.

[5]	Gunnar Rätsch, Sören Sonnenburg, Jagan Srinivasan, Hanh Witte, Klaus-Robert Mueller, Ralf J Sommer, Bernhard Schölkopf, Improving the Caenorhabditis elegans Genome Annotation Using Machine Learning, PLoS Computational Biology, 2006 (in press)

[6]	Sören Sonnenburg, Gunnar Rätsch, Konrad Rieck, Large Scale Learning with String Kernels. In: Bottou L, Chapelle O, DeCoste D, Weston J, editors, Large Scale Kernel Machines, 2007. MIT Press. pp. 73-104. In press.

[7]	James W. Kent, BLAT -- The BLAST-Like Alignment Tool, Genome Research 12: 656-664, 2002

cBio@MSKCC

Personal tools

Methods

Layer 1

Layer 2

References

Document Actions