¡¡Chinese Journal of Computers   Full Text
  TitleText Segmentation Based on Model LDA
  AuthorsSHI Jing1) HU Ming1) SHI Xin2) DAI Guo-Zhong3)
  Address1)(School of Computer Science and Engineering£¬ Changchun University of Technology£¬ Changchun 130012)
2)(Institute of Chemistry for Functionalized Materials, Liaoning Normal University, Dalian, Liaoning 116029)
3)(Computer Human Interaction and Intelligent Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing 100190)
  Year2008
  IssueNo.10(1865¡ª1873)
  Abstract &
  Background
Abstract Text segmentation is very important for many fields including information retrieval, summarization, language modeling, anaphora resolution and so on.¡¡Text segmentation based on LDA models corpora and texts with LDA. Parameters are estimated with Gibbs sampling of MCMC and the word probability is represented. Different latent topics are associated with observable words. In the experiments, Chinese whole sentences are taken as elementary blocks. Variety of similarity metrics and several approaches of discovering boundaries are tried. The best results show the right combination of them can make the error rate far lower than other algorithms of text segmentation.
Keywords text segmentation; model Latent Dirichlet Allocation(LDA); similarity metric; boundaries discovering
Background The research is supported by the National Natural Science Foundation of China. Existing work of text segmentation falls into one of two categories, lexical cohesion methods and multi-source methods. The former proposes that text segments with similar vocabulary and likely to be part of a coherent topic segment. Implementations of this idea use word stem repetition, context vectors, entity repetition, semantic similarity, word distance model and word frequency model to detect cohesion. Approaches for finding the topic boundaries include sliding window, lexical chains, dynamic programming, agglomerative clustering and divisive clustering. Multi-source methods utilize lexical cohesion metrics, cur phrases, prosodic features, ellipsis, anaphora, syntactic features, and language models to detect topic boundaries. Features are combined using decision trees, probabilistic models and maximum entropy models. Text segmentation targets on getting the structure of a text, and therefore is very useful in information retrieval, summarization, text understanding, anaphora resolution, language modeling and text navigation. Most researches aim at the applications in information retrieval. Although many researches have done on text segmentation, few trials are based on LDA model. The work in this paper introduces an approach to segment a document with word distribution computed using LDA model.