Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

[ENHANCEMENT] Global Batch Load Balancing for MoE Models

Open Taishi-N324 opened this issue 9 months ago • 3 comments

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like Implement global-batch level load balancing.

Image [Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models](https://arxiv.org/abs/2501.11873)

Benefits Based on the paper

  • Improves pre-training perplexity by ~0.1
  • Increases benchmark scores by ~2 points
  • Enables interpretable domain specialization of experts

Taishi-N324 avatar Mar 23 '25 18:03 Taishi-N324

Thanks! Let us take a deeper look @Victarry

yanring avatar May 04 '25 03:05 yanring

We will add this feature to MCore v0.13. The ETA is end of this month.

Victarry avatar May 07 '25 03:05 Victarry

Thank you for reviewing and accepting this feature request! I greatly appreciate your comprehensive support and the ongoing development and maintenance of Megatron-LM.

Taishi-N324 avatar May 07 '25 06:05 Taishi-N324

Marking as stale. No activity in 60 days.

github-actions[bot] avatar Jul 06 '25 18:07 github-actions[bot]

Thanks very much, I’ll try it out right away!

Taishi-N324 avatar Sep 14 '25 08:09 Taishi-N324