计算机学报

	Chinese Journal of Computers Full Text
Title	Text Classification Based on Labeled-LDA Model
Authors	LI Wen-Bo1),2) SUN Le1) ZHANG Da-Kun1)
Address	1)(Institute of Software, Chinese Academy of Sciences, Beijing 100080) 2)(Graduate University of the Chinese Academy of Sciences, Beijing 100049)
Year	2008
Issue	No.4(620—627)
Abstract & Background	Abstract LDA(Latent Dirichlet Allocation) is a recently proposed model which extracts latent topics from text data. In this paper, Labeled-LDA is proposed to enhance the traditional LDA to integrate the class information. Based on Labeled-LDA, a new algorithm is introduced to figure out the latent topics’ quantities of each class synergistically. In such a way, Labeled-LDA model avoids compulsive allocation behaviors of the traditional LDA when it is used as a component in classification frame. Experiments on fudan corpus and the comp subset of 20newsgrop corpus show the new method can improve text classification effectiveness: On micro_F1 measure, it approaches an improvement of 5.7% on fudan corpus and 3% on the comp subset of 20newsgrop corpus. keywords text classification; graphical model; Latent Dirichlet Allocation(LDA); variational inference background This paper focuses on the new text presentation methods and its application in text classification. Classical text presentation methods mainly include vector space model, n-grams, HMM, and etc. These text presentation methods have been widely used in natural language processing. Recently, a new type of statistical language models, named as topic model, becomes an active research direction of text presentation. The fundamental target of topic models is to explore the latent structure of document by content analysis. The differences among the topic models are mainly at the assumptions of their topic structure, such as linear array of LDA model, DAG of PAM model, complete graph of CTM model and etc. By means of more reasonable topic structure, more expressive topic model can be obtained. In their research, the authors propose a new topic model, the Labeled-LDA model, which can encode the class information of document into the traditional LDA model. In this way, they obtain a more capable text presentation method which avoids compulsive allocation behaviors of the traditional LDA when it is used in text classification. Based on the Labeled-LDA model, they introduce a new text classification algorithm to figure out the latent topics’ quantities of each class synergistically. This research is supported by the National Natural Science Foundation Program of China under grants (60773027, 60736044) and the National High Technology Research and Development Program of China(863 Program)(2006AA010108): Researches on the theory, algorithm and implement of statistical language models and their applications in areas of natural language processing and information retrieval, etc. Statistical language models play a fundamental role in the natural language processing. At the same time, information retrieval also takes the language model as one of the most important paradigms. This research group has worked on many aspects of statistical language models. Related papers have been published on international conferences (COLING-International Conference on Computational Linguistics, IJCNLP- International Joint Conference on Natural Language Processing, AIRS-Asia Information Retrieval Symposium, etc.) and journals (JCIP-Journal of Chinese Information Processing, etc.). In this paper, they study the topic language models and propose the Labeled-LDA model, which integrates the class information into traditional LDA model. Furthermore, they apply the Labeled-LDA model to text classification. Experiments show that this method can enhance performance of text classification.