BEVFormer icon indicating copy to clipboard operation
BEVFormer copied to clipboard

NCCL Error on WSL2

Open samueleruffino99 opened this issue 11 months ago • 3 comments

When I am running both train and test of the model on single GPU (./tools/fp16/dist_train.sh ./projects/configs/bevformer_fp16/bevformer_tiny_fp16.py 1), I am getting this error:

RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 439853) of binary

Do you know how to fix it? PS: I am running it on WSL2

samueleruffino99 avatar Mar 07 '24 10:03 samueleruffino99