ERROR
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:891, internal error, NCCL version 21.0.3 ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 36621) of binary: /home/xxx/anaconda3/envs/bev/bin/python
Run again with NCCL_DEBUG=INFO and provide the log. That will tell us what went wrong and what the reason for the crash could be.
If you use docker containers, it default to limited shared and pinned memory resources. When using NCCL inside a container, it is recommended that you increase these resources by issuing:
–shm-size=32g –ulimit memlock=-1
in the command line to nvidia-docker run.