deepirl_chainer Incorrect AIRL reward

Incorrect AIRL reward

Open rohitrango opened this issue 5 years ago • 1 comments

The paper suggests that the reward is given by: f(s, a, s') - \pi(a | s) (which is the same as logD - log(1-D)) but the reward in the repo is g(s, a). Why is this discrepancy?

Dec 12 '19 04:12 rohitrango

Hi rohitrango, thank you for asking the question! Answers for the question are as follows.

Discriminator do not have to add - \log \pi to reward value because - \log \pi is added to reward in PPO algorithm as an entropy term.
Of course, you can use f(s, a, s') as a reward, but I use g(s) as r(s) because the AIRL paper says r*(s) = g*(s) + const holds.

I hope that it answers your questions.

Dec 16 '19 09:12 uidilr

deepirl_chainer deepirl_chainer copied to clipboard

Incorrect AIRL reward

deepirl_chainer
deepirl_chainer copied to clipboard