Reinforcement-learning-with-tensorflow icon indicating copy to clipboard operation
Reinforcement-learning-with-tensorflow copied to clipboard

simply_PPO中与环境交互时为什么不使用old_pi而是pi

Open Qiyangcao opened this issue 6 years ago • 3 comments

with tf.variable_scope('sample_action'): self.sample_op = tf.squeeze(pi.sample(1), axis=0) # choosing action 代码中与环境交互使用的是new_pi, 但是根据surrogate loss 公式的话,

CAPTURE_201965_161923 不应该都用old pi 交互吗

Qiyangcao avatar Jun 05 '19 08:06 Qiyangcao

Check algorithm 3 on page 12 in this paper. Gradients should be computed with respect to pi & these gradients are used for global & local updates.

EDIT: See algorithm 1 (line 2) on page 3 in the same paper for details on trajectory sampling.

ChuaCheowHuan avatar Jul 03 '19 16:07 ChuaCheowHuan

Check algorithm 3 on page 12 in this paper. Gradients should be computed with respect to pi & these gradients are used for global & local updates.

Can you read Chinese?

hyc6668378 avatar Sep 22 '19 12:09 hyc6668378

#为什么不是用old_pi去选动作,而是用pi? 这个和后面的代码安排顺序有关。 #后面代码的大致执行流程是这样的:先拿pi去选出一回合的(s1,a1,s2,a2...),然后把pi复制到old_pi,然后固定old_pi不变、更新多次pi。可以看到其实就等价于是用的old_pi选动作。

planegoose avatar May 08 '20 07:05 planegoose