| ¡¡ | Chinese Journal of Computers Full Text |
| Title | Nonparametric Model and Variational Bayesian Learning for Subspace Clustering |
| Authors | QING Xiang-Yun WANG Xing-Yu |
| Address | (College of Information Science and Technology, East China University of Science and Technology, Shanghai 200237) |
| Year | 2007 |
| Issue | No.8(1333¡ª1343) |
| Abstract & Background | Abstract The goal of subspace clustering is to group a given set of data represented by different feature subsets. As an unsupervised learning method, subspace clustering tries to discover the patterns of "similarity examined under different presentations" and has received a great deal of interest and research in the related domains. Firstly the "mean and variance shift" model proposed by Hoff is extended to a new nonparametric model of subspace clustering based on subsets of features.The advantage of the model is that variational Bayesian method can be applied. The model based on the integration of a Dirichlet process mixture model and a nonparametric model of selecting subsets of features can automatically choose the number of clusters and perform subspace clustering. Then posterior inference of the model is done using Markov Chain Monte Carlo. Due to computational considerations the authors propose a variational Bayesian method to learn the parameters of the model. Experimental results using simulated data and the application to the problem of clustering face images illustrate the model can simultaneously selecting the relevant features and the data points that have similar pattern under these features. Experiments on the "multiple feature database" from the UCI repository show that variational Bayesian method without sampling can fleetly inference the parameters of the model. keywords mixture model; Dirichlet process; nonparametric Bayes; Markov chain Monte Carlo; variational learning background Various types of tasks in some specification domains, such as image segmentation, text and image classification, web semantic information extraction, etc., can be viewed as a clustering problem to solve. The goal of cluster analysis is to group a data set into clusters such that those data points in each cluster are more similar to each other than to those of other clusters. As one of the most fundamental unsupervised learning problems, it has been studied widely in the literature. However, data represented by a number of features may have discriminative information only on the subset of features. In particular, individual clusters may represent grouping on different (possibly overlapping) feature subsets, and it is interesting to discover such patterns that highlight different facets of the similarity between the data points. Therefore, subspace clustering was proposed in order to solve the problem of simultaneously choosing the subset of features and selecting the data points given those features. The first subspace algorithm, CLIQUE was proposed by Agrawal R et al. in 1998 and was soon followed by many related methods. Friedman and Meulman (2004) developed a clustering algorithm on subset of attributes, whose clustering criteria and computational approaches were largely driven by heuristics. Hoff (2006) presented a model-based subspace clustering methods based on finding groups which differ from each other in terms of their means and/or variances at one or more attributes. However, the model of "clustering shifts in mean and variance" learned by Markov Chain Monte Carlo and the computational cost may be prohibitive. So the authors extend their model to a new unified nonparametric model such that variational Bayesian method can be applied to accelerate estimation of parameters. Variational Bayesian approximations have been widely used in Bayesian learning to offset the high computational cost of exact Bayesian calculations. Nonparametric model only need little prior information and model selection is decided by data itself. The authors¡¯ model can simultaneously optimize over the number of components, the subsets of features to each of components and the parameters of the model under the MCMC and variational frameworks. The research is partially supported by National Natural Science Foundation of China under grant No.60674089 and the Doctoral Program of the Ministry of Education under grant No.20040251010. One mission of these two projects is to develop a algorithm that can automatically partition signals into different clusters and discover latent patterns from human¡¯s EEG. The research work of this paper, as a part of fundamental theory work, will be applied to these projects. The study of the team aims at meeting with the new international research tendency and integrating the studies on control theory, machine learning and brain signal. |