JIAMING LIU
Results
2
comments of
JIAMING LIU
> Hi, I had the same problem while running the training code. It seems that there is a deadlock with NCCL 2.7.8 [(check here)](https://github.com/pytorch/pytorch/issues/47885). Try using `export NCCL_P2P_DISABLE=1` before using...
One way to tack this is to first load ckpt/opt before DDP, as suggested in https://github.com/pytorch/pytorch/issues/23138. If there are other ways around, please leave comments here. Thanks.