Reinforcement-learning-with-tensorflow REINFORCE中对discounted reward的centralize的依据是什么？

REINFORCE中对discounted reward的centralize的依据是什么？

Open ZefanW opened this issue 6 years ago • 0 comments

    def _discount_and_norm_rewards(self):
        # discount episode rewards
        discounted_ep_rs = np.zeros_like(self.ep_rs)
        running_add = 0
        for t in reversed(range(0, len(self.ep_rs))):
            running_add = running_add * self.gamma + self.ep_rs[t]
            discounted_ep_rs[t] = running_add

        # normalize episode rewards
        discounted_ep_rs -= np.mean(discounted_ep_rs)
        discounted_ep_rs /= np.std(discounted_ep_rs)
        return discounted_ep_rs

normalize episode rewards这一段代码，我在绝大部分REINFORCE的code中都见到了，但是我到现在为止还没有找到其理论依据。stack overflow之类的网站上给出了不少见解，但给出的理由千差万别而且无法使人信服。举一个简明的例子。假如一个MDP的reward基本上服从一个mean为1，std远比mean小的高斯分布，那么一个episode的各步reward为[1.,1.,1.,1.,1.,1.] Gamma取为一个比较常见的值，比如0.99 那么discounted reward大致是[5.8,4.9,3.9,3.0,2.0,1] centralize之后，无论各个step相比state value是大还是小，前半部分的决策的梯度永远是正的，后半部分永远是负的。请问这个被普遍使用的训练trick有什么依据？

Apr 04 '19 07:04 ZefanW

Reinforcement-learning-with-tensorflow Reinforcement-learning-with-tensorflow copied to clipboard

REINFORCE中对discounted reward的centralize的依据是什么？

Reinforcement-learning-with-tensorflow
Reinforcement-learning-with-tensorflow copied to clipboard