- Info
Methods
We tackle the gene prediction problem taking a two-layered
approach. In a first step, state-of-the-art kernel machines
are employed to detect signal sequences in genomic DNA (like splice sites or transcription start sites) and to discriminate the content of different DNA sequences (like coding exons, introns, etc.). In a second
step their outputs are combined to predict whole gene
structures. In this step we use a discriminative training approach
based on `Hidden semi Markov support vector machines`.
Signal Sensors: We employ support vector machines (SVMs) as independent
detectors of signal sites, i.e. transition sites between segments. These signal detectors include transcriptions starts, translation initiation sites, donor and acceptor splice sites, translation termination and cleavage sites, and optionally trans splice sites and polyadenylation signals. All signal sensor SVMs use string kernels such as the Weighted Degree
kernel that operates directly on DNA sequences as input, some use
additional Spectrum kernels .
Content sensors: We also use SVMs with spectrum kernels as content
sensors to distinguish different segments by their oligo-nucleotide
composition (we consider 3- to 6-mers). There are content sensors for intergenic segments, UTR segments, introns, coding exons and optionally intercistronic segments.
Additionally, we train a sensor that discriminates in-frame coding
3-mers and 6-mers from shifted (out-of-frame) subsequences.
In the second step, the outputs of all layer 1 sensors are combined in
order to predict gene structures (segmentations). To this end, we
extended the hidden semi-Markov SVM framework , . The states (nodes) of our graphical model roughly correspond to the
signals just described (for some signals several states exist --
e.g. for ACC and DON to model coding exons in different phases).
Transitions (edges) correspond to segments starting and ending with
corresponding signals, e.g. exons start from an ACC state and end in a
DON state. The set of transitions captures valid gene structures
(valid paths through the model); polyA and trans are optional states
and can be bypassed. Our algorithm learns transformations (piecewise linear functions),
which can be seen as a weighting of the contributions of all layer 1
outputs as well as segment length contributions in order to obtain a
global score. The learning algorithm follows the large margin paradigm, maximizing the
difference between the score of the true segmentation and any other
(wrong) segmentation.