计算机学报

	Chinese Journal of Computers Full Text
Title	Parallelization of LU Decomposition on the Godson-Tv1 Many-Core Architecture
Authors	LONG Guo-Ping FAN Dong-Rui
Address	(Key Laboratory of Computer Systems and Architecture,Institute of Computing Technology,Chinese Academy of Sciences, Beijing 100190)
Year	2009
Issue	No.11(2157—2167)
Abstract & Background	Abstract The many-core architecture is increasingly becoming a promising computing platform due to the advancement of semi-conductor technology. LU decomposition is a widely used kernel in both scientific and engineering computations. Although there are a lot of related works on traditional parallel architectures, there is still little work focusing on parallelizing it on many-core architectures. This paper investigates this problem from three aspects: load balancing, latency hiding and performance modeling. There are three contributions of this work: Firstly, a novel load balancing technique has been introduced to overcome the limitations of 2D scatter decomposition. Experimental results show that the proposed scheme achieves 20% performance improvement without optimization and 40% improvement after optimization. Secondly, an analytical performance model is presented. Quantitative experimental study shows that by carefully hiding memory latency through on chip memory hierarchy and for a selected block size, the upper bound of theoretical performance can be approximated by experiments. Experimental results also reveal two primary causes which make theoretical speedup hard to achieve: limited DRAM bandwidth and resource contention of on-chip network. Keywords many-core architecture; LU decomposition; parallelization; latency tolerance; performance model Background The many-core architecture is increasingly becoming a promising computing platform due to the advancement of semi-conductor technology. Back in 2002, IBM researchers reported their early evaluation results of Cyclops many-core architecture. In 2007 ISSCC conference, Intel announced an 80-core TeraFlops prototype chip. The IBM Cyclops architecture had eventually evolved into the cache-less C64 architecture, which has been evaluated extensively by Prof. Guang R. Gao’s group at University of Delaware. This paper investigates the problem of parallelizing LU on Godson-T, a many core architecture. The authors study the problem from load balancing, latency hiding and performance modeling, specifically, propose a novel load balancing technique to overcome the limitations of 2D scatter decomposition, and propose an analytical model to understand the performance potential. The experimental results on Godson-T platform provide interesting observations regarding how to parallelize applications on many core architectures.