| ¡¡ | Chinese Journal of Computers Full Text |
| Title | An Incremental Fuzzy Decision Tree Classification Method for Data Streams Mining Based on Threaded Binary Search Trees |
| Authors | WANG Tao1) LI Zhou-Jun2) HU Xiao-Hua3) YAN Yue-Jin1) CHEN Huo-Wang1) |
| Address | 1)(School of Computer Science, National University of Defense Technology, Changsha 410073) 2)(School of Computer Science & Engineering, Beihang University, Beijing 100083) 3)(College of Information Science and Technology, Drexel University, Philadelphia PA£¬ USA) |
| Year | 2007 |
| Issue | No.8(1244¡ª1250) |
| Abstract & Background | Abstract Decision tree classification is a well-studied problem in data mining. Recently, there has been much interest in mining data streams. Domingos and Hulten have presented a one-pass algorithm. Their system, VFDT, uses Hoeffding inequality to achieve a probabilistic bound on the accuracy of the tree constructed. Gama et al. have extended VFDT in two directions. Their system VFDTc can deal with continuous data and use more powerful classification techniques at tree leaves. Peng et al. present soft discretization method to solve continuous attributes in data mining. This paper revisits this problem and implemented a system fVFDT on top of VFDT and VFDTc. It has the following four contributions: (1) It presents a threaded binary search trees (TBST) approach for efficiently handling continuous attributes. It builds a threaded binary search tree, and its processing time for values inserting is O(nlogn), while VFDT¡¯s processing time is O(n2). When a new example arrives, VFDTc need update O(logn) attribute tree nodes, but fVFDT just need update one necessary node. (2) It improves the method of getting the best split-test point of a given continuous attribute. Comparing to the method used in VFDTc, it improves from O(nlogn) to O(n) in processing time. (3) Comparing to VFDTc, fVFDT¡¯s candidate split-test number decrease from O(n) to O(logn). (4) It uses soft discretization method in data streams mining to solve the problem of noise data. keywords data streams; threaded binary search tree; continuous arribute; soft discretization; incremental; VFDT background This work is supported in part by the National Natural Science Foundation of China under grant No.60573057(Research on Some Key Algorithms in Data Mining). VFDT is an anytime system that builds decision trees using constant memory and constant time per example. It uses Hoeffding bounds to guarantee that its output is asymptotically nearly identical to that of conventional learner. VFDTc extends VFDT in two directions: The ability to deal with continuous data and the use of more powerful classification techniques at tree leaves. It¡¯s most relevant property is the ability to obtain a performance similar to a standard decision tree even for medium size datasets. In this paper, the authors present a new system fVFDT on top of VFDT and VFDTc. |