Youtube-Code-Repository icon indicating copy to clipboard operation
Youtube-Code-Repository copied to clipboard

Issue with critic target in PPO

Open davidireland3 opened this issue 2 years ago • 1 comments

In the line used to define the returns, we use the GAE + values as the target for the critic to learn. Is this correct?

My intuition says no -- the target we are training towards does not represent the true value function; should the target for value of the current state not be the observed reward + value at the next state?

Thanks!

davidireland3 avatar Feb 12 '22 22:02 davidireland3

In the line used to define the returns, we use the GAE + values as the target for the critic to learn. Is this correct?

My intuition says no -- the target we are training towards does not represent the true value function; should the target for value of the current state not be the observed reward + value at the next state?

Thanks!

Hi, I just saw your comment. I think it is correct to use the GAE + Values as the target for the critic. Roughly speaking, the GAE is shown below. GAE_t + Value_t can be used as the estimation of Value in time t.

Kin9L avatar Apr 17 '22 08:04 Kin9L