计算机学报

	Chinese Journal of Computers Full Text
Title	Prediction of Polyadenylation in Human Gene Sequences
Authors	LIAO Kun DUAN Jiang-Bo ZHOU Yan-Hong
Address	(Hubei Bioinformatics and Molecular Imaging Key Laboratory, Huazhong University of Science and Technology, Wuhan 430074)
Year	2008
Issue	No.6(927—933)
Abstract & Background	Abstract Polyadenylation (PolyA) occurs in mRNA 3’end is one of the three main steps of eukaryotic pre-mRNA processing. The prediction of polyadenylation sites in human DNA and mRNA sequences is very important for realizing the pre-mRNA processing and prediction of gene structure. This paper presents a machine learning method to predict polyadenylation signals (PASes) in human DNA and mRNA sequences. This method consists of three steps of feature manipulation: Generation, selection and integration of features. In the first step, new features are generated using k-gram nucleotide acid patterns. In the second step, a number of important features are selected by an entropy-based algorithm. In the third step, support vector machines are employed to recognize true PASes from a large number of candidates. At last, a mathematic model forms. When the sensitivity is 60%, the corresponding specificity is 71.67% on intron level, and 80.77% on exon level. Keywords Polyadenylation signals; machine learning; entropy; support vector machines Background Polyadenylation (PolyA) occurs in mRNA 3’end is one of the three main steps of eukaryotic pre-mRNA processing. The prediction of polyadenylation sites in human DNA and mRNA sequences is very important for realizing the pre-mRNA processing and prediction of gene structure. When 3’UTR occurs more than one latent PolyA sites, a selectivity polyadenylation will decide gene expression based on tissue and disease mechanism. For prediction of gene structure, identifying PolyA sites exactly is profitable on confirming 3’end. In the nearly study, there are mainly two methods for PolyA site finding: EST(Expressed Sequence Tag) based method and statistics based method. The first method is mainly analyzing the EST and genomic sequences to characterize the latent PolyA sites. One of these programs is developed by Zhengyan Kan called PASS. The statistics based method is analyzing the upstream and downstream element, profiting some useful characters to form a mathematic model for PolyA sites prediction. Polyadq and ERPIN are accomplished in this method. This paper presents a machine learning method to predict polyadenylation signals (PASes) in human DNA and mRNA sequences. When the sensitivity is 60%, the corresponding specificity is 71.67% on intron level, and 80.77% on exon level. This study is supported by the National Natural Science Foundation of China (Main Program, Grant No.90608020), the Specialized Research Fund for the Doctoral Program of Higher Education(Grant No.20050487037), Program for New Century Excellent Talents in University, and National Program for Sci-Tech Basic Conditions Platform Construction of Ministry of Science and Technology of China.