addons icon indicating copy to clipboard operation
addons copied to clipboard

AdamW / DecoupledWeightDecayExtension have a race condition

Open albertz opened this issue 3 years ago • 9 comments

The current implementation does:

def _resource_apply_dense(self, grad, var, apply_state=None):
    with tf.control_dependencies(
        [self._decay_weights_op(var, apply_state=apply_state)]
    ):
        return super()._resource_apply_dense(grad, var, apply_state=apply_state)

This means that the optimizer (Adam) update using the gradient will be done after the weight decay.

When calculating the forward + backprop (grad), it is arbitrary and non-deterministic whether this will use the original var or the weight-decayed var.

Is this intended?

A solution would be to add a control dependency for the _decay_weights_op itself, to depend on grad. This would be consistent to the PyTorch implementation (here).

albertz avatar Mar 21 '22 14:03 albertz

AdamW is Keras now. Can you check their implementaton: https://github.com/keras-team/keras/blob/master/keras/optimizers/optimizer_experimental/adamw.py

As It is in Keras we will probably route PRs there.

bhack avatar Mar 21 '22 18:03 bhack

I already checked but I'm unsure. This Keras AdamW uses eager mode execution, and has this code:

# Apply step weight decay
if self.weight_decay != 0:
  wd = tf.cast(self.weight_decay, variable.dtype)
  variable.assign_sub(variable * (1 - lr * wd))

I'm not really sure how control dependencies are handled in eager mode. Specifically, will this always be executed after gradient in that function is already computed? Or will it not have a control dependency on gradient? I don't know. Depending on this, it has the same problem or not.

albertz avatar Mar 21 '22 18:03 albertz

Theae are running in a tf.function in a classical model build/compile:

https://github.com/tensorflow/tensorflow/issues/24874#issuecomment-460882184

Please open a ticket in Keras if you want some change. We are going to deprecate this quite soon.

bhack avatar Mar 21 '22 19:03 bhack

Thanks for the link. But this still does not answer the question for me if those heuristics of tf.function would add this control dependency on gradients to variable.assign_sub(variable * (1 - lr * wd)) or not. Does it? Why?

albertz avatar Mar 22 '22 07:03 albertz

If you reed the Note box in this link with tf.function/autograph the ops are executed in the respected order:

https://www.tensorflow.org/api_docs/python/tf/control_dependencies

/cc @mdanatg probably could details that claim internals in the Docs if you have more questions.

bhack avatar Mar 22 '22 10:03 bhack

TensorFlow Addons is transitioning to a minimal maintenance and release mode. New features will not be added to this repository. For more information, please see our public messaging on this decision: TensorFlow Addons Wind Down

Please consider sending feature requests / contributions to other repositories in the TF community with a similar charters to TFA: Keras Keras-CV Keras-NLP

seanpmorgan avatar Mar 01 '23 05:03 seanpmorgan

This was not a feature request, but actually a question and maybe bug report. So no bugs are going to be fixed anymore here?

But maybe it was answered already. I'm not exactly sure.

albertz avatar Mar 01 '23 09:03 albertz

This was not a feature request, but actually a question and maybe bug report. So no bugs are going to be fixed anymore here?

But maybe it was answered already. I'm not exactly sure.

Hi @albertz , apologies this got closed along (somewhat) incorrectly. It was grouped because there is another implementation of AdamW that you should use instead Addons AdamW. I'll leave this open, but I wouldn't expect a fix for this.

seanpmorgan avatar Mar 01 '23 17:03 seanpmorgan

We'll be adding warnings to overlapping features in the coming days/weeks

seanpmorgan avatar Mar 01 '23 17:03 seanpmorgan