random-network-distillation why there are two calls to the policy, also where is the non intrinsic characteristic of intrinsic reward?

why there are two calls to the policy, also where is the non intrinsic characteristic of intrinsic reward?

Open mehdimashayekhi opened this issue 6 years ago • 1 comments

Hi, Thanks for sharing. I was wondering if you can explain why do we need two calls for apply_policy in the can_gru_policy_dynamics.py, here https://github.com/openai/random-network-distillation/blob/f75c0f1efa473d5109d487062fd8ed49ddce6634/policies/cnn_gru_policy_dynamics.py#L69 and here https://github.com/openai/random-network-distillation/blob/f75c0f1efa473d5109d487062fd8ed49ddce6634/policies/cnn_gru_policy_dynamics.py#L83

Also, I have another question. Based on the paper, intrinsic reward, should be non episodic but extrinsic reward is treated as episodic, I couldn't find where this "non episodic" charactersitic has been addressed for intrinsic reward in the implementation. Shouldn't we also add this episodic reward (i.e., eprews) to the external reward (i.e., rews_ext)?! https://github.com/openai/random-network-distillation/blob/f75c0f1efa473d5109d487062fd8ed49ddce6634/ppo_agent.py#L241

really appreciate your responses

Jan 10 '19 17:01 mehdimashayekhi

There are two graphs created for the policy / predictor, one for rollout and one for optimization. This is because at rollout time the time dimension has size 1 and is better treated separately.

If you look at https://github.com/openai/random-network-distillation/blob/f75c0f1efa473d5109d487062fd8ed49ddce6634/ppo_agent.py#L294 you'll see the intrinsic and extrinsic advantages are combined there.

Feb 01 '19 18:02 harri-edwards

random-network-distillation random-network-distillation copied to clipboard

why there are two calls to the policy, also where is the non intrinsic characteristic of intrinsic reward?

random-network-distillation
random-network-distillation copied to clipboard