计算机学报

	Chinese Journal of Computers Full Text
Title	An Application-Level Checkpointing Based on Extended Data Flow Analysis for OpenMP Programs
Authors	FU Hong-Yi DING Yan SONG Wei YANG Xue-Jun
Address	(Key Laboratory of Parallel and Distributed Processing, National University of Defence Technology, Changsha 410073) (School of Computer, National University of Defence Technology, Changsha 410073)
Year	2010
Issue	No.10(1809—1822)
Abstract & Background	Abstract As the wide application of multi-core processor architecture in the domain of high performance computing, fault tolerance for shared memory parallel programs becomes a hot spot of research. For years, checkpointing has been the dominant fault tolerance technology in this field. Recently, a few research works regarding checkpointing for OpenMP programs have been proposed. However, most of the approaches depend on special libraries or hardware platforms. This paper proposes a compiler-assisted application level checkpointing for OpenMP programs. It is a platform-independent scheme, and through the extended static data flow analysis, it automatically chooses those ‘must-be-saved’ variables to save in the checkpoint image, to reduce the overhead. It also maintains the global coherence of checkpoints by running a non-block protocol. In this paper, the key issues in the approach are discussed in detail, and the experimental result and the comparison with similar works show the proposed approach achieves promising performance. Keywords fault tolerance; shared memory; OpenMP; application level checkpointing; data flow analysis Background This research work is addressing the reliability issue of large scale parallel computing systems, and focused on checkpointing, which is widely used in the domain. For long-time-running scientific programs, periodically saving checkpoints and restart from a checkpoint upon a failure induce considerable time overhead. So far the research works concerning this are majorly done for message passing parallel programs, trying to lower the overhead for saving and restoring from checkpoints. This work is considering the checkpointing approach for shared memory parallel programs. The authors use an extended parallel data flow analysis to reduce the data amount of checkpoint as possible, to shorten the time spent for saving and restoring. This research work is support by the National Natural Science Foundation of China, with the project #60621003. The project is named as ‘The Key Techniquees for TFLOPS High Performance Computing’. It’s focused on processor architecture for high performance computing, structure-awared internetwork, and scalable parallel algorithm and system software. The group has made multiple achievements, including the programming model and compiler techniques for streaming processor architecture and for GPU architecture, and the fast failure recovery scheme based on parallel re-computing. All these works have been published in several top ranking conferences such as ISCA, PACT, PPoPP, and ICDCS. This research work is dedicated for providing application level fault tolerance solution for shared memory systems and OpenMP programs, to improve the reliability of parallel computing systems built on emerging multi-core processors. This work plays an important role in the project ‘The Key Techniques for TFLOPS High Performance Computing’.