[megatron] fix: expose moe_aux_loss_coeff and moe_z_loss_coeff to improve MoE load balancing

Open Kairosxy opened this issue 1 month ago • 2 comments

What does this PR do?

When training MoE-family models with Megatron as the backend, enabling Expert Parallelism (EP) may cause load imbalance across experts, which makes the update_actor step take progressively longer as training steps increase. Enabling load-balancing loss (aux/LBL) and z-loss alleviates this behavior.

Nov 12 '25 11:11 Kairosxy

good job! it might be better if we add some description in the script file?

Nov 12 '25 15:11 ISEEKYAN

good job! it might be better if we add some description in the script file?

Thanks for the reminder! I've added the corresponding comments to the script to clarify the purpose of these parameters.

Nov 13 '25 01:11 Kairosxy