[BUG] recompute leads to incorrect "load_balancing_loss"
Describe the bug when do recompute in the moe layer, code in https://github.com/NVIDIA/Megatron-LM/blob/f715dd857be63ca6811577baf2192f13211e5216/megatron/core/transformer/moe/router.py#L251
make the "save_to_aux_losses_tracker" called twice , which result in double load_balancing_loss value records in logs.
should skip it in recompute forward.
By the value, it does not set the "avg_group" for save_to_aux_losses_tracker, which means the load_balancing_loss value in logs is only from the last rank and do not average on data parallel group.
Was trying to fix the first part in #1433
Thanks for reporting the issue. This should be fixed in commit https://github.com/NVIDIA/Megatron-LM/commit/e6d56d6828c0773f55772b92b2ec0eed5639665e.
Marking as stale. No activity in 60 days.
@bugm @lyuwen thank you for the contributions! Please feel free to reopen if the bug is not fixed.