pytorch-optimizer
pytorch-optimizer copied to clipboard
Implement gradient pre-normalization in LAMB optimizer
This PR implements the normalization of gradients (by the norm of all gradients in the model) as discussed in https://developer.nvidia.com/blog/pretraining-bert-with-layer-wise-adaptive-learning-rates/
Adding the prenorm
Boolean option to torch_optimizer.lamb