计算机学报

	Chinese Journal of Computers Full Text
Title	A Survey on the Management of Uncertain Data
Authors	ZHOU Ao-Ying1) JIN Che-Qing1) WANG　Guo-Ren2) LI Jian-Zhong3)
Address	1)(Shanghai Key Laboratory of Trustworthy Computing,Software Engineering Institute,East China Normal University,Shanghai 200062) 2)(School of Information Science and Engineering, Northeastern University, Shenyang 110004) 3)(School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001)
Year	2009
Issue	No.1(1—16)
Abstract & Background	Abstract The importance of the data uncertainty was studied deeply with the rapid development in data gathering and processing in various fields, inclusive of economy, military, logistic, finance and telecommunication, etc. Uncertain data has many different styles, such as relational data, semistructured data, streaming data, and moving objects. According to scenarios and data characteristics, tens of data models have been developed, stemming from the core possible world model that contains a huge number of the possible world instances with the sum of probabilities equal to 1. However, the number of the possible world instances is far greater than the volume of the uncertain database, making it infeasible to combine medial results generated from all of possible world instances for the final query results. Thus, some heuristic techniques, such as ordering, pruning, must be used to reduce the computation cost for the high efficiency. This paper introduces the concepts, characteristics and challenges in uncertain data management, proposes the advance of the research on uncertain data management, including data model, preprocessing, integrating, storage, indexing, and query processing. Keywords uncertain data; possible world model; data integration; lineage; uncertain stream Background This paper surveys the recent research work on uncertain data management that belongs to the database category. Data uncertainty widely appears in various applications, inclusive of economy, military, logistic, finance and telecommunication etc. The reasons for uncertain data include, but are not limited to the following: Imprecise data caused by the physical devise, network or environment; Using a coarse-grained dataset; To meet the special application requirement; Incomplete dataset; Data integration. Thus, it is critical to develop new techniques to manage such uncertain database. The research of management on uncertain database starts from the late 80’s last century, and becomes a very hot field today. The work in the early stage focused on extending the relational model with an additional probability field to process SQL like queries, but now it has been developed to a quite boarder range. Besides the relational data, new data types such as semistructured data, streaming data, and moving objects are also studied intensively, which leads to numerous novel sophistical query processing issues. However, neither the traditional techniques for deterministic data, nor the techniques for probabilistic relational database are capable of handling such query tasks efficiently. There are already a few survey papers on management of uncertain database with different emphasis recently. Ré and Suciu summarized some big challenges in this field in 2007. Dalvi and Suciu pointed out the foundation and challenges with the analysis in theory in 2007. Aggarwal and Yu focused on algorithms and applications. The literature by Pei et al. mainly aimed at their own work. Contrarily, this paper surveys present work according to a general way of processing uncertain database, including modeling, preprocessing and cleaning, storage and indexing, and query processing. At first, several uncertain models for different data types are proposed, stemming from the core possible world semantics, following which the concepts for the data preprocessing and cleaning are also introduced. After outlining the storage and indexing techniques, the work for concrete query tasks are listed, inclusive of relational operator, data lineage, skyline query, ranking query, stream query, OLAP, and data mining.