Arraymancer
Arraymancer copied to clipboard
Optimiser: Implement AdamW and SGDW
Two variants of already implemented Arraymancer optimizers have been gaining traction recently, specifically the AdamW and SGDW optimizers. Both of these were proposed in the 2017 paper Decoupled Weight Decay Regularization but have only recently seen widespread use. Mathematical formulas for both new update procedures are given on page 3.
Tensorflow has had them implemented for a while as "weight decay optimizers," for example AdamW:
https://github.com/tensorflow/tensorflow/blob/5912f51d580551e5cee2cfde4cb882594b4d3e60/tensorflow/contrib/opt/python/training/weight_decay_optimizers.py#L356-L362
PyTorch closed a pull request implementing AdamW a month ago: https://github.com/pytorch/pytorch/blob/master/torch/optim/adamw.py with a currently open pull for SGDW.
I feel like since these two optimizers are simply modified and extended versions of optimizers already present in Arraymancer, implementing them would be time-effective.
AdamW and SGDW both operate on the principle of decoupling the weight decay factor from the gradient update and rather applying it to the weight update. Each optimizer step the previously found weights are "decayed" by some weight decay factor. This prevents the weights from growing too large and helps prevent overfitting. Additionally it moves weight decay and learning rate into a more separable space (see page 6).
There is also AMSGrad that I looked into see https://github.com/pytorch/pytorch/blob/master/torch/optim/adam.py#L20 and On the Convergence of Adam and Beyond paper.
And this led to AdamX: https://arxiv.org/abs/1904.03590 :P
Well if we want to cover all our bases we have to collect all Adam variants. In the appendix to the aforementioned AdamW paper they propose a further variant AdamR (Adam with warm restarts), which can also be combined with AdamW to form AdamWR. It's a veritable alphabet soup of Adam variants!
There is now RectifiedAdam (RAdam) (https://arxiv.org/abs/1908.03265) and Radam + LookAhead by Hinton (https://arxiv.org/abs/1907.08610v1)