deepirl_chainer icon indicating copy to clipboard operation
deepirl_chainer copied to clipboard

Incorrect AIRL reward

Open rohitrango opened this issue 5 years ago • 1 comments

The paper suggests that the reward is given by: f(s, a, s') - \pi(a | s) (which is the same as logD - log(1-D)) but the reward in the repo is g(s, a). Why is this discrepancy?

rohitrango avatar Dec 12 '19 04:12 rohitrango

Hi rohitrango, thank you for asking the question! Answers for the question are as follows.

  1. Discriminator do not have to add - \log \pi to reward value because - \log \pi is added to reward in PPO algorithm as an entropy term.
  2. Of course, you can use f(s, a, s') as a reward, but I use g(s) as r(s) because the AIRL paper says r*(s) = g*(s) + const holds.

I hope that it answers your questions.

uidilr avatar Dec 16 '19 09:12 uidilr