verl
verl copied to clipboard
[megatron] fix: expose moe_aux_loss_coeff and moe_z_loss_coeff to improve MoE load balancing
What does this PR do?
When training MoE-family models with Megatron as the backend, enabling Expert Parallelism (EP) may cause load imbalance across experts, which makes the update_actor step take progressively longer as training steps increase. Enabling load-balancing loss (aux/LBL) and z-loss alleviates this behavior.
good job! it might be better if we add some description in the script file?
good job! it might be better if we add some description in the script file?
Thanks for the reminder! I've added the corresponding comments to the script to clarify the purpose of these parameters.