| ¡¡ | Chinese Journal of Computers Full Text |
| Title | A Heterogeneous Data Stream Clustering Algorithm |
| Authors | YANG Chun-Yu ZHOU Jie |
| Address | (Department of Automation, Tsinghua University, Beijing 100084) |
| Year | 2007 |
| Issue | No.8(1364¡ª1371) |
| Abstract & Background | Abstract Data stream clustering is an important issue in data stream mining. Many real-world data streams have both continuous attributes and categorical attributes, which are usually called heterogeneous attributes. However, most of the existing stream mining algorithms can manipulate only continuous attributes or categorical attributes. To our best knowledge, there is no algorithm designed to manipulate heterogeneous attributes. Simply omitting categorical or continuous attributes may lose important information about the data stream and decrease the mining quality. This paper proposes a novel approach for clustering data stream with heterogeneous features and the Poisson Arrival model of the data stream, and gives the updating algorithm of the parameter of the process. Secondly it defines the histogram description of the discrete attributes in Micro Cluster and corresponding distance metric. Finally it proposes the framework describing the generation, evolution, merging and deletion of the Micro Clusters, and designs the detailed algorithms for each procedure. Experimental results on public data sets show that the proposed algorithm is robust. keywords data mining; data stream; clustering; heterogeneous attributes; Poisson process background Recently, the ability to capture measurements of data increased continuously. As a result, large volume and continuous growing data sequences become available to people. These data sequences are often called data streams. Examples include sensor networks, Web click streams and internet traffic flow. The most important characteristics of data stream are one pass, continuous arriving and evolving. Managing and mining data stream has gained much attention. As a fundamental machine learning and data mining technique, clustering has received extensive research in both machine learning and data mining community. While in data stream circumstance, the one pass constraint and the limitation of storage resource challenge traditional clustering algorithms designed for static database mining. The one pass constraint means that the elements in data stream can be accessed only once except explicitly stored. The limitation of storage resource means that not all the elements in the data streams can be stored even if some of them can be cached. Clustering algorithms designed for static database mining task have to be modified to accommodate these constraints under data stream environment. Many approaches have been proposed for data stream clustering. Among these algorithms, CluStream using pyramidal time frame with online and offline components proves to be efficient in many applications. Just as other algorithms, including its modified version HPStream, CluStream is designed to manipulate continuous data streams only with continuous or so called numeric features. But in real application, many data streams contain both continuous and categorical attributes. The data stream with both continuous and categorical attributes are called heterogeneous data stream. Inspired by the CluStream framework and driven by the urgent need to solve heterogeneous stream clustering problems, the authors propose an approach to manipulate the heterogeneous data stream clustering while adopt the main frame of the CluStream algorithm. The authors refer their approach as HCluStream framework, which is short for Heterogeneous CluStream. |