计算机学报

	Chinese Journal of Computers Full Text
Title	Data Provenance in a Scientific Workflow Service Framework Integrated with Object Deputy Database
Authors	WANG Li-Wei1) HUANG Ze-Qian2) LUO Min2) PENG Zhi-Yong2),3)
Address	1)(International School of Software, Wuhan University, Wuhan 430072) 2)(State Key Laboratory of Software Engineering, Wuhan University, Wuhan 430072) 3)(Computer School, Wuhan University, Wuhan 430072)
Year	2008
Issue	No.5(721—732)
Abstract & Background	Abstract This paper proposes a DB-integrated scientific workflow service framework which adopts object deputy model to describe the execution of a series of scientific tasks, thus allows workflow management operations to be performed in a way analogous to traditional database management operations. Furthermore, based on bi-directional pointer mechanism in object deputy database, this paper introduces a new data provenance method. This approach is much more efficient than annotation or inversion, which not only saves a lot of storage space, but also reduces the cost of extra computing. A partial materializing data schema is also presented to improve the efficiency of data tracking, and the experiment results show that it can provide a preferable system performance. Keywords scientific workflow; Web service; object deputy model; data provenance Background This work is supported by the National Natural Science Foundation of China under grant No.60573095, the New Century Excellent Talents of Education Ministry under grant No.NCET-04-0675, the National High Technology Research and Development Program of China under grant No.2006AA12Z210, the Doctoral Foundation of Education Ministry under grant No.20050486024, the Humanities and Social Science research base projects of the Education Ministry in 2005 under grant No.05JJD870158, the Science and Technology research projects of Education Ministry under grant No.107072, and the National Base Research and Development Program of China under grant No.2007CB310806. The most important function of scientific workflows is the way of recognizing data products, which is called data provenance. Data provenance provides derivation histories and explains the sources for data products. The solutions of determining data provenance in the literature usually involve annotations that comprise of the derivation history of a data product and inversion that generates a "reverse" query to find the origins supplied to derive a data product. Annotations may not scale well for fine-grained data as the complete annotations for the data may outsize the storage space required for the data itself. Inversion seems to be more optimal from a storage perspective since an inverse function or query identifies the provenance for an entire class of data. However, it requires a reverse query or function to be generated and executed to compute provenance every time the provenance of a data product is required. In order to make up for the shortcomings of the above methods, this paper proposes a new data provenance method based on the bi-directional pointer mechanism of object deputy model. Not only derivation history of a data product can be directly constructed by bi-directional pointers between the data product and its sources, saving a mass of storage space, but also we can directly find source data of derived data by bi-directional pointers without computing provenance using inverse queries or inverse functions, increasing the querying efficiency.