Megatron-LM [BUG] recompute leads to incorrect "load_balancing

Describe the bug when do recompute in the moe layer, code in https://github.com/NVIDIA/Megatron-LM/blob/f715dd857be63ca6811577baf2192f13211e5216/megatron/core/transformer/moe/router.py#L251

make the "save_to_aux_losses_tracker" called twice , which result in double load_balancing_loss value records in logs.

should skip it in recompute forward.

By the value, it does not set the "avg_group" for save_to_aux_losses_tracker, which means the load_balancing_loss value in logs is only from the last rank and do not average on data parallel group.

Mar 13 '25 11:03 bugm

Was trying to fix the first part in #1433

Mar 14 '25 07:03 lyuwen

Thanks for reporting the issue. This should be fixed in commit https://github.com/NVIDIA/Megatron-LM/commit/e6d56d6828c0773f55772b92b2ec0eed5639665e.

Mar 15 '25 02:03 yanring

Marking as stale. No activity in 60 days.

May 14 '25 18:05 github-actions[bot]

@bugm @lyuwen thank you for the contributions! Please feel free to reopen if the bug is not fixed.

Jul 25 '25 17:07 sbhavani

[BUG] recompute leads to incorrect "load_balancing_loss"