计算机学报

	Chinese Journal of Computers Full Text
Title	Versatile Multidimensional Histograms for Different Data Distributions
Authors	CAO Wei WANG Shan QIN Xiong-Pai WANG Qiu-Yue
Address	(Key Laboratory of Data Engineering and Knowledge Engineering, Renmin University of China, Beijing 100872)
Year	2008
Issue	No.6(1013—1024)
Abstract & Background	Abstract Traditional multidimensional histograms, which are widely used in cardinality estimation for conjunctive range query predicates in RDBMS’s query optimizers, take the assumption of the existence of correlations among attributes instead of the plausible AVI assumption. But they do not further discriminate between different degrees of correlations among attributes. Based on accurate measurements of data distributions, data correlated coefficients and value domain density, the authors propose different optimal multidimensional histograms for different data distributions, COCA-Hist. Also they analyze the worst cases for traditional MHist-2 histograms and find effective ways to alleviate the situation. The authors conduct experiments to compare the accuracy and performance between COCA-Hist, and MHist-2, GENHist and STHoles. The results demonstrate that COCA-Hist histograms are superior in accuracy and performance than MHist-2 either in average case or in worst case. In the soft functional dependence situation, COCA-Hist is much better in either accuracy or building-up time by orders of magnitudes than GENHist. Under limited space budgets, COCA-Hist is one order of magnitude efficient than STHoles in building-up time. While STHoles exhibits good accuracy under sufficient space budget, in average COCA-Hist can achieve relatively better accuracy than STHoles. Keywords multidimensional histograms; data correlated coefficients; value domain density; value domain parameter; average spread Background Multidimensional histograms are most influential and important for relational database query optimizers to accurately estimate the cardinality of the results of query predicates. In self-managing and self-tuning databases, many efforts have been put into improving the quality of query execution plans produced by query optimizers. So the statistics which are heavily relied on by query optimizers are key techniques of self-managing databases. In the database research community, many excellent kinds of histograms have been proposed to replace the fallible uniform distribution and attribute value independent assumptions. They start from what histograms should be like, rectangular shape, frequency approximation, value approximation, partitioning constraints or even more simply, just the queries, etc. to build the histograms. On the other hand, self-managing databases gather and analysis useful statistical information to reveal the hidden characteristics behind the data, so that they could adjust themselves to automatically reacting to the changing context. The authors found the famous histograms seldom take statistical characteristics of the underlying data into account when building up their own histograms. And they tried to combine both the above two sides of efforts to produce more accurate multidimensional histograms with less cost. As the authors showed in the paper, COCA-Hist, as the first to use such statistical information as correlation coefficients and value densities to describe multidimensional data distributions, exhibits superior quality and performance to those traditional multidimensional histograms. It is a data-driven method, and computing the statistical information can be done on the fly when the multidimensional data distribution matrix is being built thus inducing negligible burden. The research is partially supported by the National Natural Science Foundation of China under grant No.60473069 and No.60496325. Also it is partially supported by international cooperation program with HP Labs in the project of Large Scale Data Management. All the granted projects aim to explore and find novel, effective and efficient ways to manage large scale of data under both the rapid improvements of hardware and the more and more challenging demands of database applications. Their team has made much progress and achievements in the research direction. They set up a platform to test the quality of different multidimensional histograms. Also they have published some papers to share the ideas of using correlation coefficients to accurately and easily describe the different degrees of correlations. The research in this paper is the important part to improve the quality of relation query optimizers which, in turn, improve the performance of the whole database systems.