计算机学报

	Chinese Journal of Computers Full Text
Title	ADE-Tri-training:Tri-training with Adaptive Data Editing
Authors	DENG Chao GUO Mao-Zu
Address	(School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001)
Year	2007
Issue	No.8(1213—1226)
Abstract & Background	Abstract Tri-training, a Co-training style semi-supervised learning algorithm, can effectively exploit unlabeled examples to improve generalization ability. However, Tri-training may suffer more from the common problem in semi-supervised learning, i.e. the performance is usually not stable due to the unlabeled examples may often be wrongly labeled and accumulated during the iterative learning process. In this paper a new Tri-training style algorithm named ADE-Tri-training (Tri-training with Adaptive Data Editing) is proposed. ADE-Tri-training not only employs a specific Data Editing technique to identify and discard possible mislabeled examples along with iterations of three classifiers mutually labeling, but also takes an adaptive strategy to trigger or inhibit the editing operation according to different situation. The adaptive strategy is combinations of five precondition theorems all that will ensure reducing classification error as well as increasing the scale of new training set iteratively under the PAC theory. This paper also provides the proof of all these precondition theorems. Experiments on UCI datasets show that ADE-Tri-training could more effectively and stably utilize the unlabeled examples to improve classification generalization than Tri-training and DE-Tri-training (Tri-training with Data Editing but without adaptive strategy). keywords semi-supervised learning; data editing; adaptive strategy; PAC learning; Tri-training background Traditional supervised learning use only labeled examples to train. However, labeled instances are often difficult, expensive, or time consuming to obtain, as they require the efforts of experienced human annotators. Meanwhile unlabeled data may be relatively easy to collect, especially in Bioinformatics. Therefore, besides Bioinformatics, many other data mining tasks turn to a new machine learning mode named semi-supervised learning that exploits large number of unlabeled examples and little labeled examples to improve classifier’s generalization ability. Co-training is a well-known semi-supervised classification model, and Tri-training is a revised Co-training style semi-supervised algorithm. Compared with standard Co-training and other revised versions, Tri-training has many advantages due to employing three base-classifiers. However, Tri-training still suffers from the common disadvantage in Co-training style algorithms, i.e. the performance is not stable due to the unlabeled examples may often be wrongly labeled and accumulated during the learning process. There are very few efforts to solve the problem in most current research works. The objective of this work is to detect and remove the wrongly labeled examples during labeling unlabeled examples so that improve the stability and generalization ability of Tri-training. This approach provides a revised type of Tri-training, ADE-Tri-training. Compared with Tri-training, ADE-Tri-training employs RemoveOnly data editing operation and adopts adaptive strategy to clean newly labeled examples and control the trigger or inhibit of RemoveOnly operation. Thus the labeling process could adaptively reduce the wrongly labeled examples according to different situations and effectively avoid the negative effect of RemoveOnly. This algorithm can be used in many tasks, such as bioinformatics, text mining, web page and image classification. This work is mainly supported by the National Natural Science Foundation of China under grant No.60671011 (Research on Class-Driven RNA Secondary Structure Prediction Algorithms) and the Science Fund for Distinguished Young Scholars of Heilongjiang Province in China under grant No.JC200611 (Research on Machine Learning Algorithms for Computational Biology). These projects are focused on developing effective machine learning approaches for biological data processing and modeling. The research group has done some efforts such as proposed a permutation and GA based RNA secondary structure prediction practical approach, a PSO and EM based phylogenetic tree construction approach, etc. Prior to the work in this paper, they have proposed Tri-training and data editing based semi-supervised clustering algorithms, DE-Tri-training based K-means, in which they observed the disadvantage of trigger data editing operation by rote. Furthermore, they address it through works in this paper.