计算机学报

	Chinese Journal of Computers Full Text
Title	A Hierarchy Distance Computing Based Clustering Algorithm
Authors	PENG Jing1),2),3) TANG Chang-Jie2) CHENG Wen-Quan3) SHI Bao-Mei3) QIAO Shao-Jie2)
Address	1)(School of Electronics Engineering and Computer Science, Peking University, Beijing 100871) 2)(School of Computer Science, Sichuan University, Chengdu 610065) 3)(Department of Science and Technology, Chengdu Municipal Public Security Bureau, Chengdu 610017)
Year	2007
Issue	No.5(786—795)
Abstract & Background	Abstract To deal with the hierarchy coding data structure widely existed in application, this paper proposes a new conception of hierarchy distance and proves its mathematical properties. It also proposes and implements a new clustering algorithm-HDCA (Hierarchy Distance Computing based clustering Algorithm) based on hierarchy distance. The new algorithm overcomes the shortage of traditional algorithm and improves the precision. The paper also proposes a fast algorithm to compute the median of a hierarchy coding data set, and gives a clear proof of the algorithm. Extensive experiments demonstrate that HDCA is much faster than the naive algorithm to compute the median of hierarchy coding data. The new algorithm has been applied in the data analysis of transient population for public security successfully. keywords clustering; hierarchy distance; hierarchy coding variants; k-medoids; data mining background This research belong to the National Science Foundation of China project- "The research on key techniques in knowledge discovery based on Gene Expression programming" (60473071), and is also supported by China Postdoctoral Science Foundation under Grant No.20060400002, Sichuan Youth Science and Technology Foundation under Grant No.07ZQ026-055, National Science Foundation of China (Nos.60503037, 60473051), National High-Tech Research and Development Plan of China (No.2006AA01Z230) and the Beijing Natural Science Foundation of China (No.4062018). Current management systems of transient population only support data query and simple statistic, and don’t have any function of deep data analysis such as data mining. So it is difficult to help people making decision. Data clustering is an important branch of data mining. It does clusters process without domain knowledge. The clustering results are naturally computed from the actual distribution of data sets. The main data structures of variants that exist in clustering algorithms processing include interval-scaled variants, binary variants, nominal, ordinal variants, and ratio variants. For the other data structures are often converted into interval-scaled variants or nominal structure variants, i.e. distance between two objects are defined as 1 if two objects are equal or 0 otherwise. In real application such as information system of transient population, a lot of special information data structures are coded in hierarchy, such as regionalism, domain, with following features: (1) Variants can be divided into several parts, and the relationship of these parts is hierarchy; (2) High degree of similarity lies among those variants that have more same prefixes; otherwise there is low degree of similarity. To deal with the hierarchy coding data structure widely existed in application of transient population, this paper proposes a new data clustering algorithm-HDCA(Hierarchy Distance Computing based clustering Algorithm). The main contributions of this paper are: (1) Propose a new distance computing method for variants of HCDS (hierarchy coding data structure). (2) Modify k-medoids clustering algorithm, and propose a fast computing clustering median algorithm for hierarchy coded data sets. (3) Prove the accuracy and efficiency of the HDCA algorithm from theory and experiment. (4) Successfully apply HDCA algorithm into the clustering analysis of transient population for public security.