curiosity-driven-exploration-pytorch icon indicating copy to clipboard operation
curiosity-driven-exploration-pytorch copied to clipboard

Are you actually using the learned intrinsic reward for the agent?

Open ferreirafabio opened this issue 4 years ago • 6 comments
trafficstars

Hi,

I can only see that you optimize the intrinsic loss in your code. Can you point me to the line where you add the intrinsic rewards to the actual environment/extrinsic rewards?

In some areas of your code I can see comments like # total reward = int reward which would, according to the original paper, be wrong, no?

Thank you.

ferreirafabio avatar Feb 20 '21 16:02 ferreirafabio

Also new to the repo, but here the loss is composed of both intrinsic and extrinsic reward: https://github.com/jcwleo/curiosity-driven-exploration-pytorch/blob/bacbefdfbdbc4c4382ab67147c9c8410305a4978/agents.py#L144

ruoshiliu avatar Mar 03 '21 18:03 ruoshiliu

Thanks @ruoshiliu. Yes, I saw the loss. But in addition to optimizing the loss you also need to use the intrinsic rewards (which is the result from optimizing its loss) for the agent as stated in the paper. Only optimizing the loss is not equivalent to using the intrinsic reward as an outcome of optimizing its loss.

ferreirafabio avatar Mar 03 '21 18:03 ferreirafabio

@ferreirafabio What do you mean by "use the intrinsic rewards"? Can you point out which section in the paper stated that?

ruoshiliu avatar Mar 04 '21 01:03 ruoshiliu

By that I mean reward = extrinsic reward + intrinsic reward. From the paper:

31B16992-8338-4C37-A1AE-6983E1EB9AF1

I now realize that the paper says the extrinsic reward can be optional. Wondering what is „usually“ used (with or without extrinsic reward) when peers use ICM as a baseline.

ferreirafabio avatar Mar 04 '21 06:03 ferreirafabio

Thank you for the clarification. Let me make sure I understand your question. What you are saying is the code (referenced above) tries to minimize the loss function by maximizing the extrinsic reward and minimizing the intrinsic reward. The correct implementation should reflect the equation (7) below in which In other words, the correct implementation should find the policy p that maximizes both intrinsic and extrinsic reward and parameters for inverse model and forward model that minimizes L_I and L_F.

Did I interpret your question correctly?

Screen Shot 2021-03-04 at 16 40 59

ruoshiliu avatar Mar 04 '21 22:03 ruoshiliu

Yes

ferreirafabio avatar Mar 04 '21 22:03 ferreirafabio