Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

[BUG] reduce_aux_losses_tracker_accross_ranks hangs if first pipeline stage has no moe layers

Open i4never opened this issue 9 months ago • 5 comments

Describe the bug https://github.com/NVIDIA/Megatron-LM/blob/8a5521ac4226fbefeeb2a102ebecac32a01d4852/megatron/core/transformer/moe/moe_utils.py#L586-L588 reduce_aux_losses_tracker_across_ranks do all_reduce accross _PIPELINE_MODEL_PARALLEL_GROUP. If some pipeline stage has no moe layers, all_reduce will hangs.

To Reproduce modeling with:

--tensor-model-parallel-size 1
--pipeline-model-parallel-size 8
--expert-model-parallel-size 1
--expert-tensor-parallel-size 1
--num-layers 16
--moe-layer-freq "([0]*3+[1]*13)"

will hangs because fisrt pp stage has no tracker info while other stage has tracker info like {'load_balancing_loss': {'values': tensor([...])}}

Expected behavior First pp stage should have zero padding values.

Stack trace/logs N/A

Environment (please complete the following information):

  • Megatron-LM commit ID: 8a5521ac4226fbefeeb2a102ebecac32a01d4852
  • PyTorch version: 2.5.1
  • CUDA version: 12.4
  • NCCL version: 2.21.5

Proposed fix N/A

Additional context N/A

i4never avatar Mar 03 '25 09:03 i4never

This is a known issue. We are currently fixing it. Thanks for reporting.

Victarry avatar Mar 04 '25 02:03 Victarry

Can you also please look into this issue posted here: https://github.com/NVIDIA/Megatron-LM/issues/1462#issuecomment-2732642584, as part of this?

arjun-choudhry avatar Mar 18 '25 10:03 arjun-choudhry

Marking as stale. No activity in 60 days.

github-actions[bot] avatar May 17 '25 18:05 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar Jul 28 '25 02:07 github-actions[bot]