计算机学报

	Chinese Journal of Computers Full Text
Title	High-Bandwidth Memory Accessing Pipeline of General Purpose Processor
Authors	ZHANG Hao LIN Wei ZHOU Yong-Bin YE Xiao-Chun FAN Dong-Rui
Address	(Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190)
Year	2009
Issue	No.1(142—151)
Abstract & Background	Abstract There is a near-exponential increase in processor speed and memory capacity. However, memory latencies have not improved as dramatically, and access times are increasingly limiting system performance. Low load-to-use latency is a key to approach high memory performance, and increasing the bandwidth of memory pipeline always works. But high bandwidth brings more complexity and needs more power. The authors’ work was based on the analysis of the applications, and intend to find the head room of the performance of the memory pipeline. The authors find some useful characters of memory operations was found and give an optimized design of high bandwidth memory pipeline, which has low complexity, low latency and low power. The decisions are used to instruct the design Godsonx processor, although the bandwidth of memory access is doubled and the performance is increased by 8.6%, the extra area is only 1.7% of the original design. Keywords high bandwidth; memory pipeline; cache; TLB Background There is a near-exponential increase in processor speed and memory capacity. However, memory latencies have not improved as dramatically, and access times are increasingly limiting system performance. Low load-to-use latency is a key to approach high memory performance, and many researchers have devoted much time to increase cache hit ratio, but the pain is always much greater than the gain. Increasing the bandwidth of memory pipeline always works, but high bandwidth needs more ports for cache. Although a multi-ported cache design provides an excellent bandwidth at each cycle, its latency, area and power dissipation increase sharply as the number of ports increases. For memory access, there should be some tradeoff between performance and complexity/latency, which is more close to the performance side. This work is under the support of the National Natural Science Foundation of China (grant No.60736012), and the　National Basic Research Program (973 Program) of China (grant No.2005CB321600).