¡¡Chinese Journal of Computers   Full Text
  TitleA Novel Text Clustering Algorithm Based on Inner Product Space Model of Semantic
  AuthorsPENG Jing1),2) YANG Dong-Qing1) TANG Shi-Wei1) FU Yan1) JIANG Han-Kui2)
  Address1)(School of Electronics Engineering and Computer Science, Peking University, Beijing 100871)
2)(Information and Communication Department, Chengdu Public Security Bureau, Chengdu 610017)
  Year2007
  IssueNo.8(1354¡ª1363)
  Abstract &
  Background
Abstract Due to lack considering the latent similarity information among words, the clustering result using exist clustering algorithms in processing text data, especially in processing short text data, is not ideal. Considering the text characteristic of high dimensions and sparse space, this paper proposes a novel text clustering algorithm based on semantic inner space model. The paper creates similarity method among Chinese concepts, words and text based on the definition of inner space at first, and then analyzes systematically the algorithm in theory. Through a two phrase processes, i.e. top-down "divide" phase and a bottom-up "merge" phase, it finishes the clustering of text data. The method has been applied into the data clustering of Chinese short documents. Extensive experiments show that the method is better than traditional algorithms.

keywords inner product space; text clustering; concept similarity; similarity computing; data mining

background This research was supported by the National Natural Science Foundation of China under grant Nos.60473051, 60503037, the China Postdoctoral Science Foundation under grant No.20060400002, the Sichuan Youth Science and Technology Foundation of China under grant No.2007Q14-055, the National High-tech Research and Development of China under grant No.2006AA01Z230 and the Natural Science Foundation of Beijing under grant No.4062018.
In Web pages, there have many very short documents such as news title, abstract and annotation, etc. Recently, there has been increasing interest in data clustering of short document go with the development of Web technical and applications. Differences from traditional dataset, short document have very high dimensions and sparse data spaces.
Before using traditional method to cluster the documents, we must convert document to Vector Space Model-VSM or suffix-tree model at first. Because of the attributes of short document (high dimensions and spare data spaces), the relationship such as similarity between documents is very low in a great many conditions. However, neither Vector Space Model nor suffix-tree model does not consider the relationship between words, so the distance of similarity which computed by the traditional method doesn¡¯t match the practical conditions.
This paper proposes a novel clustering algorithm of short document based on concept similarity in Chinese text processing. The paper creates similarity method among Chinese concepts, words and text based on the definition of inner space at first, and then analyzes systematically the algorithm in theory. Through a two phrase processes, i.e. top-down "divide" phase and a bottom-up "merge" phase, it finishes the clustering of text data. The method has been applied into the data clustering of Chinese short documents. Extensive experiments show that the method is better than traditional algorithms.