计算机学报

	Chinese Journal of Computers Full Text
Title	Study on an Average Reward Reinforcement Learning Algorithm
Authors	GAO Yang1) ZHOU Ru-Yi1) WANG Hao1) CAO Zhi-Xin2)
Address	1)(State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093) 2)(Jiangsu Smart Card Engineering Technology Research Center, Zhenjiang, Jiangsu 212300)
Year	2007
Issue	No.8(1372—1378)
Abstract & Background	Abstract A large class of problems of sequence decision making is often modeled as Markov decision process (MDP). The problems whose systems with sojourn times can often be modeled as semi-Markov decision process (SMDP). When the system’s parameters are unknown in advance, reinforcement learning is used to obtain the optimal policies. In this paper, the approximate theorem of average reward reinforcement learning is proven by means of the theory of performance potentials. A novel average reward reinforcement learning algorithm, G-learning, is designed by approximating the value function of performance potentials. G-learning is applied not only in MDP, but also in SMDP. Different from the classical R-learning algorithm, the G-learning algorithm chooses the potential value of a reference state instead of the average performance of a system. In this paper, the G-learning algorithm is tested in an access-control queuing task and a production inventory task, and the experimental results show that G-learning has better learning performance than R-learning and SMART. keywords average reward reinforcement learning; performance potential; G-learning; Markov decision process; semi-Markov decision process background This research is supported by the National Natural Science Foundation of China (60475026), "Research on Reinforcement Learning Technology and Its Application in Non-Markov Decision Process", and National Science Fund for Distinguished Young Scholars (60325207), "Pattern Recognition and Artificial Intelligence". The projects mainly are focused on novel reinforcement learning technologies and algorithms. Recently, it has been shown that average reward model may outperform infinite discounted horizon model in some problems in reinforcement learning. Traditional average reward learning algorithms suffer from choosing system’s average rewards as reference. This work utilizes reinforcement learning and performance potentials, proposes a learning algorithm that could simplify learning process so as to achieve high learning efficiency.