计算机学报

	Chinese Journal of Computers Full Text
Title	A Method of Adaptively Selecting Best LDA Model Based on Density
Authors	CAO Juan1),2),3) ZHANG Yong-Dong1),2) LI Jin-Tao1),2) TANG Sheng1),2)
Address	1)(Virtual Reality Laboratory, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190) 2)(Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190) 3)(Graduate University of Chinese Academy of Sciences, Beijing 100049)
Year	2008
Issue	No.10(1780—1787)
Abstract & Background	Abstract Topic models have been successfully used to information classification and retrieval. These models can capture word correlations in a collection of textual documents with a low-dimensional set of multinomial distribution, called "topics". It is important but difficult to select an appropriate number of topics for a specific dataset. This paper proposes a theorem that the model reaches optimum as the average similarity among topics reaches minimum, and based on this theorem, proposes a method of adaptively selecting the best LDA model based on density. Experiments show that the proposed method can achieve performance matching the best of LDA without manually tuning the number of topics. Keywords topic model； topic； LDA； density Background Statistical topic models such as Latent Dirichlet Allocation(LDA) have been successfully used to analyze large amounts of textual information in many tasks, including language modeling, document classification, information retrieval, document summarization and data mining. These models can capture word correlations in a collection of textual documents with a low-dimensional set of multinomial distribution, called "topics". To further model the inter-topic correlations, recent advances such as Correlated Topic Model (CTM) in this area have explored richer structures to discover large numbers of more accurate and fine-grained topics. But all these models have the same practical difficulty to determine the number of topics. Model selection methods such as cross-validation and Bayesian model testing are usually inefficient. Teh et al. propose the Hierarchical Dirichlet Process (HDP) to solve the problem. Dirichlet process does not require specifying the number of mixture components in advance, and the HDP can realize the share of the mixture components among a set of mixture models. This paper is supported by the National High Technology Research and Development Program (863 Program) of China under grant No.2007AA01Z416; the National Basic Research Program (973 Program) of China under grant No.2007CB311100; the National Natural Science Foundation of China under grant No.60773056; the Beijing New Star Project on Science & Technology under grant No.2007B071.