[BUG] DeadLock when using GQA and Mcore
Describe the bug
I run the NeMo code and get the job stuck with pipeline_model_parallel_size > 1 using 8 GPUs.
I run the job using the mainline NeMo and Megatron-lm code, and enable mcore_gpt: True num_query_groups: 4
Also install the latest Transformer Engine. and turn off flash attention and fused attention by appending NVTE_FLASH_ATTN=0 NVTE_FUSED_ATTN=0 before the command (python ...py)
To Reproduce Use the mainline
Expected behavior A clear and concise description of what you expected to happen.
Stack trace/logs print out the tensors to be sent and received in forward_backward_pipelining_without_interleaving
rank: 0 recv_tensor_shapes: [(2048, 1, 768)] send_tensor_shapes: [(2048, 1, 768)] rank: 1 recv_tensor_shapes: [(2048, 1, 768)] send_tensor_shapes: [(2048, 1, 768)]
rank: 0 input_tensor [None] rank: 0 output_tensor[0] torch.Size([128, 1, 768])
input_tensor = recv_forward(recv_tensor_shapes, config) The input_tensor is None??
Environment (please complete the following information):
- Megatron-LM commit ID: 2ea701c78ed03924069846fcfd445f3415be7b56
- transformer-engine: transformer-engine-0.12.0.dev0+630a131
- PyTorch version: torch 2.0.0a0+1767026
- CUDA version: 12.1
- NCCL version: (2, 17, 1)
Proposed fix If you have a proposal for how to fix the issue state it here or link to a PR.
Additional context Add any other context about the problem here.
Marking as stale. No activity in 60 days.
Hi have you solve this problem?
Marking as stale. No activity in 60 days.