Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

[BUG] recompute leads to incorrect "load_balancing_loss"

Open bugm opened this issue 9 months ago • 3 comments

Describe the bug when do recompute in the moe layer, code in https://github.com/NVIDIA/Megatron-LM/blob/f715dd857be63ca6811577baf2192f13211e5216/megatron/core/transformer/moe/router.py#L251

make the "save_to_aux_losses_tracker" called twice , which result in double load_balancing_loss value records in logs.

should skip it in recompute forward.

By the value, it does not set the "avg_group" for save_to_aux_losses_tracker, which means the load_balancing_loss value in logs is only from the last rank and do not average on data parallel group.

bugm avatar Mar 13 '25 11:03 bugm

Was trying to fix the first part in #1433

lyuwen avatar Mar 14 '25 07:03 lyuwen

Thanks for reporting the issue. This should be fixed in commit https://github.com/NVIDIA/Megatron-LM/commit/e6d56d6828c0773f55772b92b2ec0eed5639665e.

yanring avatar Mar 15 '25 02:03 yanring

Marking as stale. No activity in 60 days.

github-actions[bot] avatar May 14 '25 18:05 github-actions[bot]

@bugm @lyuwen thank you for the contributions! Please feel free to reopen if the bug is not fixed.

sbhavani avatar Jul 25 '25 17:07 sbhavani