OLMo icon indicating copy to clipboard operation
OLMo copied to clipboard

Add MuonW optimizer: Muon with AdamW fallback for non-matrix parameters

Open JenWei0312 opened this issue 4 months ago • 2 comments

This PR adds the MuonW optimizer to OLMo, implementing the Muon optimization algorithm with AdamW fallback for non-matrix parameters.

Key features:

  • Implements Muon's Newton-Schulz orthogonalization for matrix parameters (2D+)
  • Falls back to AdamW for scalar/vector parameters and embeddings/heads
  • Fully compatible with distributed training (FSDP)
  • Includes comprehensive metric tracking for monitoring

Implementation details:

  • Based on the original Muon paper and reference implementation
  • Adds distributed metric collection and reduction
  • Handles conv filters through reshaping
  • Supports selective weight updates and gradient clipping

Testing:

  • Tested on single GPU/CPU with comprehensive test suite.
  • Mock tests verify distributed code paths
  • Convergence verified on regression tasks

Happy to add config integration if there's interest. Tested locally - all core functionality working.

JenWei0312 avatar Aug 31 '25 02:08 JenWei0312

Hi team, just wanted to gently follow up on this PR for the MuonW optimizer.

I know you're all very busy, so no rush at all. Please let me know if there are any questions, changes, or additional tests I can provide from my end to help move the review process along.

Thanks for your time and for maintaining this great project!

JenWei0312 avatar Sep 17 '25 02:09 JenWei0312

Hi there, thanks for your contribution and interest! We apologize for the delay in response to your PR - we are indeed at a busy time of year. We will take a look at this as soon as we can!

baileykuehl avatar Sep 17 '25 19:09 baileykuehl