计算机学报

	Chinese Journal of Computers Full Text
Title	An Algorithm for Identification of Repeats with Accurate Boundaries
Authors	HUO Hong-Wei BAI Fan
Address	(School of Computer Science and Technology, Xidian University, Xi’an 710071)
Year	2008
Issue	No.2(214—219)
Abstract & Background	Abstract Most existing methods of repeat identification either rely on annotated repeat databases or limit repeats to pairs of similar sequences that are maximal in length. And there is no an exact definition to correctly balances the importance of the length and the frequency. For these shortages, a fast method for repeats identification of repeat families via extension of consensus seed is proposed in this paper, which enables a rigorous definition of repeat boundaries and is based on the variant of BLAST algorithm. The known C.briggsae is used for testing the RepeatSearcher. RepeatSearcher is compared with RECON, the most popular repeats identification algorithm, and the newly developed RepeatScout. The experimental results indicate that RepeatSearcher has more accurate boundaries for each repeats, and the time of RepeatSearcher is reduced as compared with other methods with guaranteed accuracy. keywords consensus sequence; repeat identification; accurate boundaries; BLAST; RepeatSearcher background Repetitive DNA comprises a significant fraction of eukaryotic genomes, e.g. about 20% of Caenorhabdits elegans and Caenorhabditis briggsae genomes and about 50% of the human genome have been identified as repetitive DNA. Repeat identification is a critical part of the analysis of a newly sequenced genome, both because repeats drive genome evolution in diverse ways and because of a pragmatic need for thorough repeat masking prior to performing homology searches. Multiple genome projects have generated large volumes of biological data. Novel techniques are called for to process these data and extract useful information. Several fundamental questions remains unanswered in Genomics. Some of the most intriguing questions are the roles of non-coding DNA, and in particular the role of repetitive sequences. Most existing algorithms for constructing repeat families start with a set of pairwise similarities to construct a set of repeats, such as WU_BLAST or REPuter. They only report all pairs of repeated segments. The early single linkage clustering approach first merges overlapping substrings occurring in the set of pairwise similarities, and then uses the pairwise similarities to combine the merged substrings into repeat families. The algorithms, RepeatFinder and RECON, have significantly extended and improved the single linage clustering approach. The RepeatGluter algorithm is the distinct method to glue pairwise similarities into repeat families, in which A-Bruijn graph is proposed to explore the block structure of the repeat sequences. The PILER algorithm is also the deserved algorithm, which achieves high specificity in identifying different types of repeats at the cost of some sensitivity. There are two drawbacks using a set of pairwise similarities as the basic to construct the set of repeat families. First it is a tough work to define the true element boundary. Second, exact constructing a set of pairwise similarities involves computationally intensive tasks for the large repeat-rich genomes. For example, there are over 106 copies Alu repeats in human genome and these copies can lead to about 1012 pairwise alignments, thus constructing such a set of pairwise similarities makes it computationally infeasible. A fast method for repeats identification of repeat families via extension of consensus seed is proposed in this paper, which enables a rigorous definition of repeat boundaries and is based on the variant of BLAST algorithm. the scoring function is given and the exact boundary function is defined. The experimental results indicate the feasibility of the given algorithm.