Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

[BUG] Crash when enable --tp-comm-overlap

Open zhang662817 opened this issue 2 years ago • 7 comments

Describe the bug crash when enable --tp-comm-overlap in examples/pretrain_gpt_distributed_with_mp.sh image

To Reproduce image

Environment (please complete the following information):

  • Megatron-LM commit ID: 9290c730d04b482be8fae92a4186fe4ff0c95270
  • PyTorch Docker: nvcr.io/nvidia/pytorch 23.10-py3

zhang662817 avatar Nov 28 '23 07:11 zhang662817

And how to config --tp-comm-overlap-cfg?

zhang662817 avatar Nov 28 '23 08:11 zhang662817

same problem

bisunny avatar Dec 15 '23 07:12 bisunny

@zhang662817 hello, Can you run it successfully now?

bisunny avatar Dec 19 '23 03:12 bisunny

Marking as stale. No activity in 60 days.

github-actions[bot] avatar Feb 17 '24 18:02 github-actions[bot]

I have the same problem, have you solved it?

ZhongYFeng avatar Mar 06 '24 04:03 ZhongYFeng

I have the same problem

zhangyuqin1998 avatar Apr 16 '24 06:04 zhangyuqin1998

I have the same problem. :(

1926627357 avatar May 11 '24 06:05 1926627357

Marking as stale. No activity in 60 days.

github-actions[bot] avatar Jul 10 '24 18:07 github-actions[bot]

@erhoo82 Can you takea look into this.

shanmugamr1992 avatar Jul 10 '24 18:07 shanmugamr1992

Can you try --mip=pmix in your srun script? tensor-parallel communication overlap uses MPI bootstrapping. We are trying to move on to using NCCL.

erhoo82 avatar Jul 10 '24 18:07 erhoo82