[BUG] Crash when enable --tp-comm-overlap
Describe the bug
crash when enable --tp-comm-overlap in examples/pretrain_gpt_distributed_with_mp.sh
To Reproduce
Environment (please complete the following information):
- Megatron-LM commit ID: 9290c730d04b482be8fae92a4186fe4ff0c95270
- PyTorch Docker: nvcr.io/nvidia/pytorch 23.10-py3
And how to config --tp-comm-overlap-cfg?
same problem
@zhang662817 hello, Can you run it successfully now?
Marking as stale. No activity in 60 days.
I have the same problem, have you solved it?
I have the same problem
I have the same problem. :(
Marking as stale. No activity in 60 days.
@erhoo82 Can you takea look into this.
Can you try --mip=pmix in your srun script?
tensor-parallel communication overlap uses MPI bootstrapping. We are trying to move on to using NCCL.