deepirl_chainer
deepirl_chainer copied to clipboard
Incorrect AIRL reward
The paper suggests that the reward is given by: f(s, a, s') - \pi(a | s) (which is the same as logD - log(1-D)) but the reward in the repo is g(s, a). Why is this discrepancy?
Hi rohitrango, thank you for asking the question! Answers for the question are as follows.
- Discriminator do not have to add - \log \pi to reward value because - \log \pi is added to reward in PPO algorithm as an entropy term.
- Of course, you can use f(s, a, s') as a reward, but I use g(s) as r(s) because the AIRL paper says r*(s) = g*(s) + const holds.
I hope that it answers your questions.