AdamW-pytorch
AdamW-pytorch copied to clipboard
weight decay multiplied by learning rate
If you look carefully at the formula in the article, the weight decay (w) is not multiplied by the learning rate (alpha), but rather by the schedule coefficient (eta). In the pytorch implementation eta seems to be always assumed to be one. In your code, however, you implicitly multiply w by alpha (step_size). I suggest changing the corresponding line to
p.data.add_(torch.mul(p.data, group['weight_decay']).addcdiv(-step_size, exp_avg, denom) )