random-network-distillation icon indicating copy to clipboard operation
random-network-distillation copied to clipboard

why there are two calls to the policy, also where is the non intrinsic characteristic of intrinsic reward?

Open mehdimashayekhi opened this issue 6 years ago • 1 comments

Hi, Thanks for sharing. I was wondering if you can explain why do we need two calls for apply_policy in the can_gru_policy_dynamics.py, here https://github.com/openai/random-network-distillation/blob/f75c0f1efa473d5109d487062fd8ed49ddce6634/policies/cnn_gru_policy_dynamics.py#L69 and here https://github.com/openai/random-network-distillation/blob/f75c0f1efa473d5109d487062fd8ed49ddce6634/policies/cnn_gru_policy_dynamics.py#L83

Also, I have another question. Based on the paper, intrinsic reward, should be non episodic but extrinsic reward is treated as episodic, I couldn't find where this "non episodic" charactersitic has been addressed for intrinsic reward in the implementation. Shouldn't we also add this episodic reward (i.e., eprews) to the external reward (i.e., rews_ext)?! https://github.com/openai/random-network-distillation/blob/f75c0f1efa473d5109d487062fd8ed49ddce6634/ppo_agent.py#L241

really appreciate your responses

mehdimashayekhi avatar Jan 10 '19 17:01 mehdimashayekhi

There are two graphs created for the policy / predictor, one for rollout and one for optimization. This is because at rollout time the time dimension has size 1 and is better treated separately.

If you look at https://github.com/openai/random-network-distillation/blob/f75c0f1efa473d5109d487062fd8ed49ddce6634/ppo_agent.py#L294 you'll see the intrinsic and extrinsic advantages are combined there.

harri-edwards avatar Feb 01 '19 18:02 harri-edwards