addons
addons copied to clipboard
AdamW / DecoupledWeightDecayExtension have a race condition
The current implementation does:
def _resource_apply_dense(self, grad, var, apply_state=None):
with tf.control_dependencies(
[self._decay_weights_op(var, apply_state=apply_state)]
):
return super()._resource_apply_dense(grad, var, apply_state=apply_state)
This means that the optimizer (Adam) update using the gradient will be done after the weight decay.
When calculating the forward + backprop (grad), it is arbitrary and non-deterministic whether this will use the original var or the weight-decayed var.
Is this intended?
A solution would be to add a control dependency for the _decay_weights_op itself, to depend on grad. This would be consistent to the PyTorch implementation (here).
AdamW is Keras now. Can you check their implementaton: https://github.com/keras-team/keras/blob/master/keras/optimizers/optimizer_experimental/adamw.py
As It is in Keras we will probably route PRs there.
I already checked but I'm unsure. This Keras AdamW uses eager mode execution, and has this code:
# Apply step weight decay
if self.weight_decay != 0:
wd = tf.cast(self.weight_decay, variable.dtype)
variable.assign_sub(variable * (1 - lr * wd))
I'm not really sure how control dependencies are handled in eager mode. Specifically, will this always be executed after gradient in that function is already computed? Or will it not have a control dependency on gradient? I don't know. Depending on this, it has the same problem or not.
Theae are running in a tf.function in a classical model build/compile:
https://github.com/tensorflow/tensorflow/issues/24874#issuecomment-460882184
Please open a ticket in Keras if you want some change. We are going to deprecate this quite soon.
Thanks for the link. But this still does not answer the question for me if those heuristics of tf.function would add this control dependency on gradients to variable.assign_sub(variable * (1 - lr * wd)) or not. Does it? Why?
If you reed the Note box in this link with tf.function/autograph the ops are executed in the respected order:
https://www.tensorflow.org/api_docs/python/tf/control_dependencies
/cc @mdanatg probably could details that claim internals in the Docs if you have more questions.
TensorFlow Addons is transitioning to a minimal maintenance and release mode. New features will not be added to this repository. For more information, please see our public messaging on this decision: TensorFlow Addons Wind Down
Please consider sending feature requests / contributions to other repositories in the TF community with a similar charters to TFA: Keras Keras-CV Keras-NLP
This was not a feature request, but actually a question and maybe bug report. So no bugs are going to be fixed anymore here?
But maybe it was answered already. I'm not exactly sure.
This was not a feature request, but actually a question and maybe bug report. So no bugs are going to be fixed anymore here?
But maybe it was answered already. I'm not exactly sure.
Hi @albertz , apologies this got closed along (somewhat) incorrectly. It was grouped because there is another implementation of AdamW that you should use instead Addons AdamW. I'll leave this open, but I wouldn't expect a fix for this.
We'll be adding warnings to overlapping features in the coming days/weeks