baselines
baselines copied to clipboard
Clipped Loss in Critic of ppo2 unnecessarily long?
Hi, I've been wondering whether the code for the approximate trust region update of the critic (via clipping losses) is a little more convoluted than it has to be.
specifically, the code section I am referring to is
# Clip the value to reduce variability during Critic training
# Get the predicted value
vpred = train_model.vf
vpredclipped = OLDVPRED + tf.clip_by_value(train_model.vf - OLDVPRED, - CLIPRANGE, CLIPRANGE)
# Unclipped value
vf_losses1 = tf.square(vpred - R)
# Clipped value
vf_losses2 = tf.square(vpredclipped - R)
vf_loss = .5 * tf.reduce_mean(tf.maximum(vf_losses1, vf_losses2))
I take that this prodcues a zero-gradient, as soon as we overstep the specified cliprange, but I am wondering whether the following code would just do the same thing (it does in my own trials):
vpred = train_model.vf
vpredclipped = OLDVPRED + tf.clip_by_value(train_model.vf - OLDVPRED, - CLIPRANGE, CLIPRANGE)
vf_loss = .5 * tf.reduce_mean(tf.square(vpredclipped - R))
The way I see it, vpredclipped is the same as vpred anyways so long as the current prediction lies within the cliprange. Not that this bares any great computational benefits, I am just asking for clarification if there is something I am overseeing or another reason behind the way the code is written as it is.
On a sidenote: Were there attempts at making the cliprange scale invariant by simply approximating the variance of vpred and vpredclipped by the respective total losses as suggested by the GAE paper ?
Best regards anyboby
also why we need to clip it using the same value as the one that policy loss ratio using?