Özet:
Due to an explosion in the quantity of DNA sequences over the past decades, development of new methods to accurately detect the genes is vital. The success of these methods strongly depends on precise identification of the splice sites.
In eukaryotic genomes, each gene is composed of exons and introns. During DNA transcription only exons of the gene, which contain codes for proteins are transcribed into mRNAs. The term splice site refers to the boundary between exon and intron. While the intron-exon junction with consensus dinucleotide AG is called acceptor splice site, donor splice site refers to an exon-intron junction with consensus dinucleotide GT. In DNA sequence, splice site prediction is a search problem for finding donor and acceptor boundaries.
Numerous Machine Learning methods have been used for splice sites identification. Performances of these methods highly depend on the DNA encoding approaches, which try to extract informative features from DNA sequences.
Using AdaBoost classifier, we have proposed three new DNA encoding methods for feature extraction by combining several approaches that have already proven successful in determining pattern around splice sites. the proposed approaches provided significantly better performance than eleven current state-of-the-art algorithms based on several performance criteria.
We also have developed an online prediction server (HSSAda) based on proposed approach, which is freely available at https://pashaei.shinyapps.io/hssada. The HSSAda tool achieved higher accuracy while compared with the existing tools like NNplice, WMM, MM1, and MEM, using the
independent test set. It is believed the proposed methods can be helpful in discovering location and structure of eukaryotic genes due to their high prediction accuracy and simplicity.
We also assessed the performance of RF as classification and feature selection method in splice site prediction domain. The investigation tried to answer the question whether RF outperforms SVM, which is the most outstanding classification approach in splice site detection, using Markovian encoding methods or not.
Finally, we proposed another DNA encoding method using SVM and second order Markov model for splice site detection.