[BUG] reduce_aux_losses_tracker_accross_ranks hangs if first pipeline stage has no moe layers
Describe the bug
https://github.com/NVIDIA/Megatron-LM/blob/8a5521ac4226fbefeeb2a102ebecac32a01d4852/megatron/core/transformer/moe/moe_utils.py#L586-L588
reduce_aux_losses_tracker_across_ranks do all_reduce accross _PIPELINE_MODEL_PARALLEL_GROUP. If some pipeline stage has no moe layers, all_reduce will hangs.
To Reproduce modeling with:
--tensor-model-parallel-size 1
--pipeline-model-parallel-size 8
--expert-model-parallel-size 1
--expert-tensor-parallel-size 1
--num-layers 16
--moe-layer-freq "([0]*3+[1]*13)"
will hangs because fisrt pp stage has no tracker info while other stage has tracker info like {'load_balancing_loss': {'values': tensor([...])}}
Expected behavior First pp stage should have zero padding values.
Stack trace/logs N/A
Environment (please complete the following information):
- Megatron-LM commit ID: 8a5521ac4226fbefeeb2a102ebecac32a01d4852
- PyTorch version: 2.5.1
- CUDA version: 12.4
- NCCL version: 2.21.5
Proposed fix N/A
Additional context N/A
This is a known issue. We are currently fixing it. Thanks for reporting.
Can you also please look into this issue posted here: https://github.com/NVIDIA/Megatron-LM/issues/1462#issuecomment-2732642584, as part of this?
Marking as stale. No activity in 60 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.