baselines icon indicating copy to clipboard operation
baselines copied to clipboard

GAE and Critic Loss (PPO2)

Open rohey opened this issue 4 years ago • 0 comments

Hello. Can you please explain why are you using the mb_returns = mb_advs(GAE) + mb_values as the returns to compute the critic loss ? Should not the value function approximately represent the discounted sum of rewards ? E.g., R = gamma * R + rewards[i] value_loss = value_loss + 0.5 * (R - values[i]).pow(2).

If I understand correctly, the value function depends on the parameter γ and not on the parameter λ based on the paper https://arxiv.org/pdf/1506.02438.pdf. However, If I use the GAE(γ, λ) advantages to compute the returns and use those returns to train the critic, wouldn't the the value function become V(γ, λ) instead of V(γ) ? And if it is true, would the TD residual be correctly computed with V(γ, λ) ?

rohey avatar Oct 11 '21 17:10 rohey