baselines
baselines copied to clipboard
PPO2 clip value loss
If Clip the value to reduce variability during Critic training, why using tf.maximum(tf.square(vpred - R), tf.square(vpredclipped - R)) instead of using tf.minimum. Please tell me the reasons.
I have the same question. Also, the clip range tf.clip(vpred-OLDVPRED, -CLIPRANGE, CLIPRANGE) is weird. Probably it should be tf.clip(vpred-OLDVPRED, -tf.abs(OLDVPRED)*CLIPRANGE, tf.abs(OLDVPRED)*CLIPRANGE)??
Does anyone have further insight on either/both of the questions posed?
Issue #445 has some discussion on this
I have the same question too. Searched for an answer but no luck. Please tell me the reason why it uses maximum instead of minimum.
Ok. I got the answer now. Because baseline calculate pgloss1 and pgloss2 with negative sign. So that it uses maximum afterward.