ElegantRL icon indicating copy to clipboard operation
ElegantRL copied to clipboard

Question about the implementation of advantage function

Open aoshuqiu opened this issue 4 years ago • 0 comments

In eRL_demo_PPOinSingleFile.py there are two get_reward_sum functions:

   def get_reward_sum_raw(self, buf_len, buf_reward, buf_mask, buf_value) -> (torch.Tensor, torch.Tensor):
        buf_r_sum = torch.empty(buf_len, dtype=torch.float32, device=self.device)  # reward sum

        pre_r_sum = 0
        for i in range(buf_len - 1, -1, -1):
            buf_r_sum[i] = buf_reward[i] + buf_mask[i] * pre_r_sum
            pre_r_sum = buf_r_sum[i]
        buf_advantage = buf_r_sum - (buf_mask * buf_value[:, 0])
        return buf_r_sum, buf_advantage

    def get_reward_sum_gae(self, buf_len, ten_reward, ten_mask, ten_value) -> (torch.Tensor, torch.Tensor):
        buf_r_sum = torch.empty(buf_len, dtype=torch.float32, device=self.device)  # old policy value
        buf_advantage = torch.empty(buf_len, dtype=torch.float32, device=self.device)  # advantage value

        pre_r_sum = 0
        pre_advantage = 0  # advantage value of previous step
        for i in range(buf_len - 1, -1, -1):
            buf_r_sum[i] = ten_reward[i] + ten_mask[i] * pre_r_sum
            pre_r_sum = buf_r_sum[i]
            buf_advantage[i] = ten_reward[i] + ten_mask[i] * (pre_advantage - ten_value[i])  # fix a bug here
            pre_advantage = ten_value[i] + buf_advantage[i] * self.lambda_gae_adv
        return buf_r_sum, buf_advantage

They both mutiply current state's value by a mask. And in update_buffer function we can see the mask contains a gamma ten_mask = (1.0 - torch.as_tensor(_trajectory[2], dtype=torch.float32)) * gamma

That means V(st) is mutiplied by gamma, but we don't need a gamma for current state's value. Does it right? I am having trouble in understanding this, please help me out, thanks.

aoshuqiu avatar Oct 15 '21 06:10 aoshuqiu