AdamW-pytorch weight decay multiplied by learning rate

weight decay multiplied by learning rate

Open kgoba opened this issue 5 years ago • 8 comments

If you look carefully at the formula in the article, the weight decay (w) is not multiplied by the learning rate (alpha), but rather by the schedule coefficient (eta). In the pytorch implementation eta seems to be always assumed to be one. In your code, however, you implicitly multiply w by alpha (step_size). I suggest changing the corresponding line to

p.data.add_(torch.mul(p.data, group['weight_decay']).addcdiv(-step_size, exp_avg, denom) )

Dec 17 '18 14:12 kgoba

AdamW-pytorch AdamW-pytorch copied to clipboard

weight decay multiplied by learning rate

AdamW-pytorch
AdamW-pytorch copied to clipboard