muon optimizer support

Open ZhiyuLi-Nvidia opened this issue 4 months ago • 2 comments

muon optimizer has attracted lots of interests in the community and is currently WIP in mcore. Also, it has been reported the model performance is even better if the same optimizer is used both in pretraining and post-training stages.

existing effort in distributed mcore which will be merged as part of megatron-core

Aug 25 '25 18:08 ZhiyuLi-Nvidia

@ZhiyuLi-Nvidia do we know what would be the expected gain?

Sep 14 '25 23:09 snowmanwwg

Muon has achieved the following empirical results.

Improved the speed record for training to 94% accuracy on CIFAR-10 from 3.3 to 2.6 A100-seconds.
Improved the speed record for training to 3.28 val loss on FineWeb (a competitive task known as NanoGPT speedrunning) by a factor of 1.35x.
Continued showing training speed improvements while scaling to 774M and 1.5B parameters.
Trained a 1.5B parameter transformer to GPT-2 XL level performance on HellaSwag in 10 8xH100-hours. Using AdamW to achieve the same result takes 13.3 hours.

see https://kellerjordan.github.io/posts/muon/

Sep 14 '25 23:09 ZhiyuLi-Nvidia