| ¡¡ | Chinese Journal of Computers Full Text |
| Title | Parallelization of LU Decomposition on the Godson-Tv1 Many-Core Architecture |
| Authors | LONG Guo-Ping FAN Dong-Rui |
| Address | (Key Laboratory of Computer Systems and Architecture,Institute of Computing Technology,Chinese Academy of Sciences, Beijing 100190) |
| Year | 2009 |
| Issue | No.11(2157¡ª2167) |
| Abstract & Background | Abstract The many-core architecture is increasingly becoming a promising computing platform due to the advancement of semi-conductor technology. LU decomposition is a widely used kernel in both scientific and engineering computations. Although there are a lot of related works on traditional parallel architectures, there is still little work focusing on parallelizing it on many-core architectures. This paper investigates this problem from three aspects: load balancing, latency hiding and performance modeling. There are three contributions of this work: Firstly, a novel load balancing technique has been introduced to overcome the limitations of 2D scatter decomposition. Experimental results show that the proposed scheme achieves 20% performance improvement without optimization and 40% improvement after optimization. Secondly, an analytical performance model is presented. Quantitative experimental study shows that by carefully hiding memory latency through on chip memory hierarchy and for a selected block size, the upper bound of theoretical performance can be approximated by experiments. Experimental results also reveal two primary causes which make theoretical speedup hard to achieve: limited DRAM bandwidth and resource contention of on-chip network. Keywords many-core architecture; LU decomposition; parallelization; latency tolerance; performance model Background The many-core architecture is increasingly becoming a promising computing platform due to the advancement of semi-conductor technology. Back in 2002, IBM researchers reported their early evaluation results of Cyclops many-core architecture. In 2007 ISSCC conference, Intel announced an 80-core TeraFlops prototype chip. The IBM Cyclops architecture had eventually evolved into the cache-less C64 architecture, which has been evaluated extensively by Prof. Guang R. Gao¡¯s group at University of Delaware. This paper investigates the problem of parallelizing LU on Godson-T, a many core architecture. The authors study the problem from load balancing, latency hiding and performance modeling, specifically, propose a novel load balancing technique to overcome the limitations of 2D scatter decomposition, and propose an analytical model to understand the performance potential. The experimental results on Godson-T platform provide interesting observations regarding how to parallelize applications on many core architectures. |