addons
addons copied to clipboard
Clipvalue not working as expected when using AdamW and SGDW optimizer - destructive updates destroy model weights
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
- TensorFlow version and how it was installed (source or binary):2.9.1
- TensorFlow-Addons version and how it was installed (source or binary): 0.17.1
- Python version: 3.10
- Is GPU used? (yes/no): Yes
Describe the bug DecoupledWeightDecayExtension https://github.com/tensorflow/addons/blob/b2dafcfa74c5de268b8a5c53813bc0b89cadf386/tensorflow_addons/optimizers/weight_decay_optimizers.py#L25
Computes and applies its gradients without any reference to the loss. Which is fine and intended - that's why it's called decoupled. The problem is that it also does not respect clipvalue (and clipnorm) because of this.
Code to reproduce the issue
https://colab.research.google.com/drive/1G3rJMs_V6cI6TkHDgXCyQOapQEcuCWeW?usp=sharing
EDIT: Playing around a bit more, there's definitely a bug, but I'm not sure my explanation of it is the true one. The problem doesn't seem to arise except for large (>1) values of weight_decay. However, I can tell you that in real practice I am getting very poor results as well as nans while trying to use AdamW even with a clipvalue in place that should prevent any such nans from occurring. I'll investigate further as I have time.
EDIT EDIT: Removed the whole part where I speculated on the cause, I was wrong and misunderstood how the weight decay is implemented. Read the paper linked on the doc page more closely and figured out the real culprit, which is not what I thought, it's just that the weight decay here is literally multiplying every single weight by (1-wd) every epoch. At the very least, this is poorly documented - you can only figure it out by reading the linked paper, when it should be right on the docs page as it's pretty important to know. Additionally, there is no check for invalid weight_decay values (e.g. >1).
One of the first results when someone googles AdamW and wants to learn about it is https://www.fast.ai/2018/07/02/adam-weight-decay/ which recommends a value of .3 for the weight decay parameter. Now, again a closer look reveals that they are multiplying this number by the learning rate - which is also apparently the case in torch implementation https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html
This is, I suppose, convenient if you want to have your decay scale with some learning rate schedule without doing any extra work. By the same token, inconvenient if you don't.
In any case, AdamW presents itself as an optimizer, and clipvalue on an optimizer is supposed to mean that gradient updates larger than the clipvalue don't happen. Nonetheless, the weight decay on AdamW can certainly make gradient updates in excess of the clipvalue. I understand why now, but I do think there should be some additional warning/documentation on this, to make it completely clear up front that the weight decay step is completely independent and not affected by anything you might usually think affects an optimizer (e.g. learning rate, clipvalue, etc.), even though it is all being done by an optimizer class object.
AdamaW Is now in Keras if you want to add something to the doc you need to do It there: https://github.com/keras-team/keras/blob/master/keras/optimizers/optimizer_experimental/adamw.py
TensorFlow Addons is transitioning to a minimal maintenance and release mode. New features will not be added to this repository. For more information, please see our public messaging on this decision: TensorFlow Addons Wind Down
Please consider sending feature requests / contributions to other repositories in the TF community with a similar charters to TFA: Keras Keras-CV Keras-NLP