计算机学报

	Chinese Journal of Computers Full Text
Title	A Text Similarity Measurement Combining Word Semantic Information with TF-IDF Method
Authors	HUANG Cheng-Hui1),2) YIN Jian1) HOU Fang2)
Address	1)(School of Information Science and Technology, SUN Yat-Sen University, Guangzhou 510006) 2)(Department of Computer Science and Technology, Guangdong University of Finance, Guangzhou 510520)
Year	2011
Issue	No.5(856—864)
Abstract & Background	Abstract Traditional text similarity measurements use TF-IDF method to model text documents as term frequency vectors, and compute similarity between text documents by using cosine similarity. These methods ignore semantic information of text documents, and semantic information enhanced methods distinguish between text documents poorly because extended vectors with semantic similar terms aggravate the curse of dimensionality. This paper proposes a similarity measurement, which is based on TF-IDF method, and analyzes similarity between important terms in text documents. This approach uses NLP technology to pre-process text, and uses TF-IDF method to filter those key terms that have higher TF-IDF value than other common terms. With the proposed data structure TSWT(Term Similarity Weight Tree) and the definition of semantic similarity, this paper resolves the semantic information of those key terms to compute similarities between text documents. Finally, several K-Means clustering methods is used for evaluating performance of the new text document similarity. By comparing with TF-IDF and another the-state-of-art semantic information based similarity method, experimental results on benchmark corpus demonstrate that it can promote the evaluation metrics of F-Measure. Keywords text clustering; term semantic similarity; text similarity; natural language process Background This work is supported by the National Natural Science Foundation of China (61033010), Research Foundation of National Science and Technology Plan Project (2008ZX10005-013), Research Foundation of Science and Technology Plan Project in Guangdong Province (2009A080207005, 2009B090300450, 2010A040303004) How to build a document similarity model is critical to text mining. For our task, given two input text document, we want to determine a semantic similarity between them, thus our method goes beyond the simple word frequency based methods. Traditional word frequency methods model documents as TF-IDF vectors and use cosine similarity or Jaccard coefficient to compute similarity between documents. The TF-IDF vector ignores the meaning of words and the structure of documents. With TF-IDF model, users must process a vector set, which has large numbers of vectors and each vector has a dimension equals to words number, therefore inevitably leads to inefficient computing. This paper improves on the state-of-the-art by combining TF-IDF with term semantic information in an integrated way: to analyze term significance and select those terms that have high TF-IDF values, then compute semantic similarity of these terms with external dictionary WordNet and Term Similarity Weight Tree. This method selects those terms with high TF-IDF value, therefore it can reduce dimension of document model effectively than traditional document model. At the same time, it analyzes semantic information of these terms and is closer to human’s way to understand documents. Our group has been working on the research of text similarity in text mining, and using several optimization technologies to compute the text similarity such as semantic information of words, word sequences and syntax structural information of document. Several papers have been published in respectable national journals.