计算机学报

	Chinese Journal of Computers Full Text
Title	Optimizing the SIMD Parallelism Through Bitwidth Analysis
Authors	ZHANG Wei-Hua ZHU Jia-Hua ZHANG Hong-Jiang ZANG Bin-Yu
Address	(Parallel Processing Institute, Fudan University, Shanghai 200433)
Year	2009
Issue	No.11(2168—2177)
Abstract & Background	Abstract Although the SIMD units have been widely used in different architecture designs, the automatic optimizations for such architectures are not well developed yet. Since most optimizations for SIMD architectures are transplanted from traditional vectorization techniques, many special features of SIMD architectures, such as packed operations, have not been thoroughly considered. While operands are tightly packed within a register, there is no spare space to indicate overflow. To maintain the accuracy of automatic SIMDized programs, the operands should be unpacked to preserve enough space for interim overflow. However, such a strategy would lead to great overhead. Moreover, the additional instructions for handling overflows can sometimes prevent other optimizations. In this paper, a new technique, BCSA (Bitwidth controlled SIMD arithmetic), is proposed to reduce the negative effects caused by interim overflow handling and eliminate the interference of interim overflows. The algorithm is applied to the multimedia benchmarks of Berkeley. The experimental results show that the algorithm can significantly improve the performance of multimedia applications. Keywords bitwidth analysis; overflow analysis; saturation operation; compiler optimization; parallelism Background Since multimedia has become a dominating computing field, to meet such a trend, almost all general purposed processor (GPP) venders have integrated multimedia extensions (MME) into their processors. Due to the potential parallelism and the low calculative precision requirement of multimedia applications most MME are implemented with Single Instruction Multi Data (SIMD) instruction sets. Currently, programmers are mainly restricted to utilize these SIMD instructions through in-lining assembly codes or intrinsic functions. With these methods, the development become extremely inefficient and the code would be hard to be transplanted between different platforms. An alternative way is to make compiler automatically generate SIMD instructions from the code of standard high level programming languages. Although SIMD optimization is a part of vectorization, the traditional vectorization technique could not be simply transplanted to SIMD optimization due to the differences between vector processor and SIMD architecture. Currently, there is only few compilers could speedup some individual multimedia applications. With the support of Specialized Research Fund for the Doctoral Program of Chinese Higher Education under Grant No. 20050246020;the National Nature Science Foundation of China under Grand No. 60273046; Shanghai Science and Technology Committee of China Key Project Funding (02JC14013), the authors carried on a series research to develop efficient SIMD optimization techniques. Based on the deep study to the SIMD architecture and widely analyzing to the multimedia workload, they find out some useful techniques in this area, such as how to perform highly accurately data bit width analysis, how to develop potential parallelism in saturation arithmetic mode and how to automatically transform C programs into SIMD instructions based on Iburg. Meanwhile, the authors implemented these techniques with open source compiler Gcc3.5 and parallelization research platform Aggassiz as well. Experimental results show those methods are effective.