计算机学报

	Chinese Journal of Computers Full Text
Title	Improved Growing Learning Vector Quantification for Text Classification
Authors	WANG Xiu-Jun SHEN Hong
Address	(Department of Computer Science and Technology, University of Science and Technology of China, Hefei 230039)
Year	2007
Issue	No.8(1277—1285)
Abstract & Background	Abstract As a simple classification method KNN has been widely applied in text classification. There are two problems in KNN-based text classification: the large computation load and the deterioration of classification accuracy caused by the non-uniform distribution of training samples. To solve these problems, based on minimizing the increment of learning errors and combining LVQ and GNG, the authors propose a new growing LVQ method and apply it to text classification. The method can generate an effective representative sample set after one phase of selective training of the training sample set, and hence has a strong learning ability. Experimental results show that this method can not only reduce the testing time of KNN, but also maintain or even improve the accuracy of classification. keywords learning vector quantification； growing neural gas； learning error； inter-class distance； learning probability background This paper addresses the problem of effective text classification in a large set of non-uniformly distributed documents. When the document set is large and documents are non-uniformly distributed, the classical KNN method has a low classification accuracy and excessively long classification time. Many solutions have been proposed to deal with this problem in the literature, but they focus only on quickly searching for the nearest neighbor or cutting the sample which they consider redundant from original sample set. Most of them can reduce the classification time but not improve the classification accuracy significantly. To tackle the problem the authors develop a new growing LVQ algorithm in which LVQ is modified by incorporating a learning error and growing mechanism of GNG. The algorithm generates new sample sets based on the original sample set, and then conducts the classification process on the new sample sets. Experimental results show that the authors’ method can improve both the classification speed and accuracy. The classification accuracy of the proposed method is approximately the same as SVM in common categories of documents and better than SVM in rare categories. Text classification is a very important problem as the documents number grows rapidly with the growth of WWW. KNN as an efficient classification method fails to work satisfactorily for large sets of non-uniformly distributed document. The authors work refines the original sample set and improves the classification both in speed and accuracy. This work is a main research stream in the Laboratory of Service Computing and Applications in University of Science and Technology of China, and is supported by the 100 Talents Program of the Chinese Academy of Sciences.