addons AdamW / DecoupledWeightDecayExtension have a race condition

The current implementation does:

def _resource_apply_dense(self, grad, var, apply_state=None):
    with tf.control_dependencies(
        [self._decay_weights_op(var, apply_state=apply_state)]
    ):
        return super()._resource_apply_dense(grad, var, apply_state=apply_state)

This means that the optimizer (Adam) update using the gradient will be done after the weight decay.

When calculating the forward + backprop (grad), it is arbitrary and non-deterministic whether this will use the original var or the weight-decayed var.

Is this intended?

A solution would be to add a control dependency for the _decay_weights_op itself, to depend on grad. This would be consistent to the PyTorch implementation (here).

Mar 21 '22 14:03 albertz

AdamW is Keras now. Can you check their implementaton: https://github.com/keras-team/keras/blob/master/keras/optimizers/optimizer_experimental/adamw.py

As It is in Keras we will probably route PRs there.

Mar 21 '22 18:03 bhack

I already checked but I'm unsure. This Keras AdamW uses eager mode execution, and has this code:

# Apply step weight decay
if self.weight_decay != 0:
  wd = tf.cast(self.weight_decay, variable.dtype)
  variable.assign_sub(variable * (1 - lr * wd))

I'm not really sure how control dependencies are handled in eager mode. Specifically, will this always be executed after gradient in that function is already computed? Or will it not have a control dependency on gradient? I don't know. Depending on this, it has the same problem or not.

Mar 21 '22 18:03 albertz

Theae are running in a tf.function in a classical model build/compile:

https://github.com/tensorflow/tensorflow/issues/24874#issuecomment-460882184

Please open a ticket in Keras if you want some change. We are going to deprecate this quite soon.

Mar 21 '22 19:03 bhack

Thanks for the link. But this still does not answer the question for me if those heuristics of tf.function would add this control dependency on gradients to variable.assign_sub(variable * (1 - lr * wd)) or not. Does it? Why?

Mar 22 '22 07:03 albertz

If you reed the Note box in this link with tf.function/autograph the ops are executed in the respected order:

https://www.tensorflow.org/api_docs/python/tf/control_dependencies

/cc @mdanatg probably could details that claim internals in the Docs if you have more questions.

Mar 22 '22 10:03 bhack

TensorFlow Addons is transitioning to a minimal maintenance and release mode. New features will not be added to this repository. For more information, please see our public messaging on this decision: TensorFlow Addons Wind Down

Please consider sending feature requests / contributions to other repositories in the TF community with a similar charters to TFA: Keras Keras-CV Keras-NLP

Mar 01 '23 05:03 seanpmorgan

This was not a feature request, but actually a question and maybe bug report. So no bugs are going to be fixed anymore here?

But maybe it was answered already. I'm not exactly sure.

Mar 01 '23 09:03 albertz

This was not a feature request, but actually a question and maybe bug report. So no bugs are going to be fixed anymore here?

But maybe it was answered already. I'm not exactly sure.

Hi @albertz , apologies this got closed along (somewhat) incorrectly. It was grouped because there is another implementation of AdamW that you should use instead Addons AdamW. I'll leave this open, but I wouldn't expect a fix for this.

Mar 01 '23 17:03 seanpmorgan

We'll be adding warnings to overlapping features in the coming days/weeks

Mar 01 '23 17:03 seanpmorgan

addons addons copied to clipboard

AdamW / DecoupledWeightDecayExtension have a race condition

addons
addons copied to clipboard