addons
addons copied to clipboard
Make reduce LR on plateau weight-decay compatible for decoupled optimizers
Describe the feature and the current behavior/state.
It has been shown that decoupling weight decay from the learning rate can simply hyper parameter search and leads to better performance.
https://arxiv.org/abs/1711.05101
Tensorflow has implementations for Adam and SGD.
https://www.tensorflow.org/addons/api_docs/python/tfa/optimizers/SGDW
However the docs mention that when we decay the lr, we also need to decay the weight decay itself by the same factor. This is easy to do if we are using our own static schedulers like in the example they provided but i think reduce LR on plateau is still a valid technique and adding the weight decay-decay into this callback is fairly straightforward.
All we would really need to do is check if there is weight decay and if there is, just decay it just the same as with the lr. Existing functionality would remain unchanged and if there is weight decay, it should automatically be decayed anyway with the lr right? even if that is not the case, a simple additional argument "decay_wd=False" should suffice.
Relevant information
- Are you willing to contribute it (yes/no): Yes
- Are you willing to maintain it going forward? (yes/no): Yes
- Is there a relevant academic paper? (if so, where): Referenced above
- Is there already an implementation in another framework? (if so, where): Unsure
- Was it part of tf.contrib? (if so, where): Unsure
Which API type would this fall under (layer, metric, optimizer, etc.) Callback
Who will benefit with this feature? Users that use loss-based learn rate decay and also use optimizers with decoupled weight decay
Ex.
if self.monitor_op(current, self.best):
self.best = current
self.wait = 0
elif not self.in_cooldown():
self.wait += 1
if self.wait >= self.patience:
old_lr = float(K.get_value(self.model.optimizer.lr))
old_wd = float(K.get_value(self.model.optimizer.weight_decay))
if old_lr > self.min_lr:
new_lr = old_lr * self.factor
new_wd = old_wd * self.factor
new_lr = max(new_lr, self.min_lr)
K.set_value(self.model.optimizer.weight_decay, new_wd)
K.set_value(self.model.optimizer.lr, new_lr)
if self.verbose > 0:
print('\nEpoch %05d: ReduceLROnPlateau reducing learning '
'rate to %s.' % (epoch + 1, new_lr))
self.cooldown_counter = self.cooldown
self.wait = 0
Hi @ben-arnao! DoesWeightDecayOptimizer base class sufficiently implement this for you? https://github.com/tensorflow/addons/blob/master/tensorflow_addons/optimizers/weight_decay_optimizers.py#L24
As an example implementation SGDW: https://github.com/tensorflow/addons/blob/master/tensorflow_addons/optimizers/weight_decay_optimizers.py#L261
Hi, thanks for the reply but I think you may have misunderstood my question. Sorry if I was not clear. In the documentation for SGDW it is recommend that you reduce the weight decay itself with any LR schedulers you may have. Because of this, if I use the reduce LR on plateau callback. I need to add custom code to use both reduce LR on plateau with SGDW. I am wondering if it might be a good idea to add this functionality as a baseline to RLRoP.
On Sun, Mar 8, 2020, 10:43 PM Sean Morgan [email protected] wrote:
Hi @ben-arnao https://github.com/ben-arnao! DoesWeightDecayOptimizer base class sufficiently implement this for you?
https://github.com/tensorflow/addons/blob/master/tensorflow_addons/optimizers/weight_decay_optimizers.py#L24
As an example implementation SGDW:
https://github.com/tensorflow/addons/blob/master/tensorflow_addons/optimizers/weight_decay_optimizers.py#L261
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/addons/issues/1236?email_source=notifications&email_token=AB5OIMJ57OYDFRTYZTFUCZDRGRQUXA5CNFSM4LDIRYB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOFOT3A#issuecomment-596306412, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5OIMLTXYNMRRMOMDC5FXTRGRQUXANCNFSM4LDIRYBQ .
Hi, thanks for the reply but I think you may have misunderstood my question. Sorry if I was not clear. In the documentation for SGDW it is recommend that you reduce the weight decay itself with any LR schedulers you may have. Because of this, if I use the reduce LR on plateau callback. I need to add custom code to use both reduce LR on plateau with SGDW. I am wondering if it might be a good idea to add this functionality as a baseline to RLRoP.
Ah apologies, I understand now. Do you imagine this new callback being a subclass of ReduceLROnPlateau with the modified on_epoch_end? Any thoughts about the name of the new callback?
As just a proof of concept would it be possible to run this in a Colab notebook to show the weight-decay-decay as a useful addition? Ideally showing improved loss convergence on some dataset.
++ @Squadrick who is more authored the optimizer with weight-decay extension for inputs.
@seanpmorgan Maybe something like ReduceLRonPlateauWithWeightDecay would be most descriptive, that is getting rather wordy though.
However all scenarios in which we are reducing lr we should really be reducing wd too. Because of this it would make most sense to me for this functionality to be built into the main callback. Although I understand this library doesn't deal with modifications to the main tf library so maybe better to do this here and see if users find it useful.
The reasoning for why this needs to be done seems intuitive to me. When you change the learning rate which affects the magnitude parameters are adjusted per step, you would need to proportionally reduce then the magnitude of which these parameters decay by, per step. If you dont do this you'll reach a point where the learn rate is not enough to overcome weight decay and the weights just go to zero.
Unfortunately I may not be the best person to ask for a more mathematical proof than that. The authors of the documentation might be able to provide something better.
Anyhow if this is something others would be interested in I can make a PR.
What's the status on this? It seems straightforward enough to modify the on_epoch_end and modify the weight decay parameter in the same fashion as the learning rate (as indicated by the docs), but I'd rather have a public implementation of it.
TensorFlow Addons is transitioning to a minimal maintenance and release mode. New features will not be added to this repository. For more information, please see our public messaging on this decision: TensorFlow Addons Wind Down
Please consider sending feature requests / contributions to other repositories in the TF community with a similar charters to TFA: Keras Keras-CV Keras-NLP