Megatron-LM [BUG] DeadLock when using GQA and Mcore

Describe the bug

I run the NeMo code and get the job stuck with pipeline_model_parallel_size > 1 using 8 GPUs.

I run the job using the mainline NeMo and Megatron-lm code, and enable mcore_gpt: True num_query_groups: 4

Also install the latest Transformer Engine. and turn off flash attention and fused attention by appending NVTE_FLASH_ATTN=0 NVTE_FUSED_ATTN=0 before the command (python ...py)

To Reproduce Use the mainline

Expected behavior A clear and concise description of what you expected to happen.

Stack trace/logs print out the tensors to be sent and received in forward_backward_pipelining_without_interleaving

rank: 0 recv_tensor_shapes: [(2048, 1, 768)] send_tensor_shapes: [(2048, 1, 768)] rank: 1 recv_tensor_shapes: [(2048, 1, 768)] send_tensor_shapes: [(2048, 1, 768)]

rank: 0 input_tensor [None] rank: 0 output_tensor[0] torch.Size([128, 1, 768])

input_tensor = recv_forward(recv_tensor_shapes, config) The input_tensor is None??

Environment (please complete the following information):

Megatron-LM commit ID: 2ea701c78ed03924069846fcfd445f3415be7b56
transformer-engine: transformer-engine-0.12.0.dev0+630a131
PyTorch version: torch 2.0.0a0+1767026
CUDA version: 12.1
NCCL version: (2, 17, 1)

Proposed fix If you have a proposal for how to fix the issue state it here or link to a PR.

Additional context Add any other context about the problem here.

Aug 16 '23 22:08 xuwenju123

Marking as stale. No activity in 60 days.

Oct 16 '23 18:10 github-actions[bot]

Hi have you solve this problem?

Apr 08 '24 07:04 kaiwang13

Marking as stale. No activity in 60 days.

Jun 07 '24 18:06 github-actions[bot]