verl icon indicating copy to clipboard operation
verl copied to clipboard

[megatron] fix: expose moe_aux_loss_coeff and moe_z_loss_coeff to improve MoE load balancing

Open Kairosxy opened this issue 1 month ago • 2 comments

What does this PR do?

When training MoE-family models with Megatron as the backend, enabling Expert Parallelism (EP) may cause load imbalance across experts, which makes the update_actor step take progressively longer as training steps increase. Enabling load-balancing loss (aux/LBL) and z-loss alleviates this behavior.

Kairosxy avatar Nov 12 '25 11:11 Kairosxy

good job! it might be better if we add some description in the script file?

ISEEKYAN avatar Nov 12 '25 15:11 ISEEKYAN

good job! it might be better if we add some description in the script file?

Thanks for the reminder! I've added the corresponding comments to the script to clarify the purpose of these parameters.

Kairosxy avatar Nov 13 '25 01:11 Kairosxy