RL
RL copied to clipboard
muon optimizer support
muon optimizer has attracted lots of interests in the community and is currently WIP in mcore. Also, it has been reported the model performance is even better if the same optimizer is used both in pretraining and post-training stages.
existing effort in distributed mcore which will be merged as part of megatron-core
@ZhiyuLi-Nvidia do we know what would be the expected gain?
Muon has achieved the following empirical results.
- Improved the speed record for training to 94% accuracy on CIFAR-10 from 3.3 to 2.6 A100-seconds.
- Improved the speed record for training to 3.28 val loss on FineWeb (a competitive task known as NanoGPT speedrunning) by a factor of 1.35x.
- Continued showing training speed improvements while scaling to 774M and 1.5B parameters.
- Trained a 1.5B parameter transformer to GPT-2 XL level performance on HellaSwag in 10 8xH100-hours. Using AdamW to achieve the same result takes 13.3 hours.
see https://kellerjordan.github.io/posts/muon/