random-network-distillation
random-network-distillation copied to clipboard
why there are two calls to the policy, also where is the non intrinsic characteristic of intrinsic reward?
Hi, Thanks for sharing. I was wondering if you can explain why do we need two calls for apply_policy
in the can_gru_policy_dynamics.py
, here https://github.com/openai/random-network-distillation/blob/f75c0f1efa473d5109d487062fd8ed49ddce6634/policies/cnn_gru_policy_dynamics.py#L69 and here https://github.com/openai/random-network-distillation/blob/f75c0f1efa473d5109d487062fd8ed49ddce6634/policies/cnn_gru_policy_dynamics.py#L83
Also, I have another question. Based on the paper, intrinsic reward, should be non episodic but extrinsic reward is treated as episodic, I couldn't find where this "non episodic" charactersitic has been addressed for intrinsic reward in the implementation. Shouldn't we also add this episodic reward (i.e., eprews) to the external reward (i.e., rews_ext)?! https://github.com/openai/random-network-distillation/blob/f75c0f1efa473d5109d487062fd8ed49ddce6634/ppo_agent.py#L241
really appreciate your responses
There are two graphs created for the policy / predictor, one for rollout and one for optimization. This is because at rollout time the time dimension has size 1 and is better treated separately.
If you look at https://github.com/openai/random-network-distillation/blob/f75c0f1efa473d5109d487062fd8ed49ddce6634/ppo_agent.py#L294 you'll see the intrinsic and extrinsic advantages are combined there.