ERROR

Open sdc-sdd opened this issue 2 years ago • 2 comments

RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:891, internal error, NCCL version 21.0.3 ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 36621) of binary: /home/xxx/anaconda3/envs/bev/bin/python

Mar 18 '24 09:03 sdc-sdd

Run again with NCCL_DEBUG=INFO and provide the log. That will tell us what went wrong and what the reason for the crash could be.

May 09 '24 00:05 lix19937

If you use docker containers, it default to limited shared and pinned memory resources. When using NCCL inside a container, it is recommended that you increase these resources by issuing:

–shm-size=32g –ulimit memlock=-1 in the command line to nvidia-docker run.

May 09 '24 00:05 lix19937