计算机学报

	Chinese Journal of Computers Full Text
Title	Real-Time Processing for High Speed Data Stream over Large Scale Data
Authors	QI Kai-Yuan1),2) ZHAO Zhuo-Feng1),3) FANG Jun1),3) MA Qiang1),2)
Address	1)(Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190) 2)(Graduate University of Chinese Academy of Sciences, Beijing 100190) 3)(College of Information Engineering,North China University of Technology, Beijing 100144)
Year	2012
Issue	No.3(477—490)
Abstract & Background	Abstract With the development of Internet of Things, the computing based on real-time and historical sensor data becomes the key point to the IoT applications, and how to support the real-time processing for high speed data stream over large scale data brings a new challenge . However, the existing large scale data processing technology based on the MapReduce model is designed for batch processing and cannot satisfy the real-time requirement. Based on the theory and practice analysis, this paper proposes a method for large scale data processing under high speed data stream, and improves the technical bottlenecks such as local staged pipeline and intermediate result storage. We tune the configuration of staged pipeline dynamically using system information to efficiently utilize CPU, and design the data structure, read/write operation strategy and replacement algorithm to optimize the high concurrency access performance of local intermediate results. The experiment shows that this method can improve real-time performance and scalability of data stream processing over large scale history data. Keywords data stream processing; large scale data processing; MapReduce; Internet of Things; big data; cloud computing Background Large scale data processing and data stream processing are classical research topics in data management. With the development of cloud computing and Internet of Things, computing based on real-time and historical sensor data becomes the key point to the Internet of Things applications, and how to solve the real-time computing for high speed data stream over large scale persistent data brings a new challenge . The existing batch processing based MapReduce large scale data processing architecture cannot satisfy the real-time requirement, and the previous data stream processing architecture cannot deal with large scale historical data. This paper first proves that the MapReduce model can be used for large scale data processing under high speed data stream, then proposes a new MapReduce architecture for such kind of applications, and removes some technical bottlenecks such as local staged pipeline and intermediate result storage. Based on a benchmark from a real IoT application, we can see this MapReduce architecture improves performance and scalability for data stream processing over large scale data compared with previous work. This work was supported by the National Natural Science Foundation of China under Grant Nos.60903137 and 61003294.