AdamW-pytorch icon indicating copy to clipboard operation
AdamW-pytorch copied to clipboard

weight decay multiplied by learning rate

Open kgoba opened this issue 5 years ago • 8 comments

If you look carefully at the formula in the article, the weight decay (w) is not multiplied by the learning rate (alpha), but rather by the schedule coefficient (eta). In the pytorch implementation eta seems to be always assumed to be one. In your code, however, you implicitly multiply w by alpha (step_size). I suggest changing the corresponding line to

p.data.add_(torch.mul(p.data, group['weight_decay']).addcdiv(-step_size, exp_avg, denom) )

kgoba avatar Dec 17 '18 14:12 kgoba