JIAMING LIU

Results 2 comments of JIAMING LIU

> Hi, I had the same problem while running the training code. It seems that there is a deadlock with NCCL 2.7.8 [(check here)](https://github.com/pytorch/pytorch/issues/47885). Try using `export NCCL_P2P_DISABLE=1` before using...

One way to tack this is to first load ckpt/opt before DDP, as suggested in https://github.com/pytorch/pytorch/issues/23138. If there are other ways around, please leave comments here. Thanks.