baselines icon indicating copy to clipboard operation
baselines copied to clipboard

Clipped Loss in Critic of ppo2 unnecessarily long?

Open anyboby opened this issue 5 years ago • 1 comments

Hi, I've been wondering whether the code for the approximate trust region update of the critic (via clipping losses) is a little more convoluted than it has to be.

specifically, the code section I am referring to is

        # Clip the value to reduce variability during Critic training
        # Get the predicted value
        vpred = train_model.vf
        vpredclipped = OLDVPRED + tf.clip_by_value(train_model.vf - OLDVPRED, - CLIPRANGE, CLIPRANGE)
        # Unclipped value
        vf_losses1 = tf.square(vpred - R)
        # Clipped value
        vf_losses2 = tf.square(vpredclipped - R)

        vf_loss = .5 * tf.reduce_mean(tf.maximum(vf_losses1, vf_losses2))

I take that this prodcues a zero-gradient, as soon as we overstep the specified cliprange, but I am wondering whether the following code would just do the same thing (it does in my own trials):

        vpred = train_model.vf
        vpredclipped = OLDVPRED + tf.clip_by_value(train_model.vf - OLDVPRED, - CLIPRANGE, CLIPRANGE)

        vf_loss = .5 * tf.reduce_mean(tf.square(vpredclipped - R))

The way I see it, vpredclipped is the same as vpred anyways so long as the current prediction lies within the cliprange. Not that this bares any great computational benefits, I am just asking for clarification if there is something I am overseeing or another reason behind the way the code is written as it is.

On a sidenote: Were there attempts at making the cliprange scale invariant by simply approximating the variance of vpred and vpredclipped by the respective total losses as suggested by the GAE paper ?

Best regards anyboby

anyboby avatar Aug 14 '20 15:08 anyboby

also why we need to clip it using the same value as the one that policy loss ratio using?

shtse8 avatar Aug 31 '20 03:08 shtse8