NCCL watchdog thread terminated with exception
Hello ,
I have two h100 devices. I'm running an application via DeepSpeedChat. I ran LLama2-Chat-hf 3 4 times before and finished the training. Either the training starts and explodes in the middle, or it doesn't start at all and throws this error. But when I start the training, I encounter an error. I will share it below. I really couldn't solve this problem, what should I do?
one gpu and more gpus throws this error.
Do I need to upgrade or downgrade the CUDA versions? I would appreciate your help.
#ERROR
[E ProcessGroupNCCL.cpp:915] [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: unknown error CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd75cf92617 in /usr/anaconda3/envs/train/lib/python3.10/site-packages/torch/lib/libc10.so) frame https://github.com/pytorch/pytorch/issues/1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fd75cf4d98d in /usr/anaconda3/envs/train/lib/python3.10/site-packages/torch/lib/libc10.so) frame https://github.com/pytorch/pytorch/issues/2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fd806cea9f8 in /usr/anaconda3/envs/train/lib/python3.10/site-packages/torch/lib/libc10_cuda.so) frame https://github.com/pytorch/pytorch/issues/3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7fd6e8500af0 in /usr/anaconda3/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame https://github.com/pytorch/pytorch/pull/4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7fd6e8504918 in /usr/anaconda3/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame https://github.com/pytorch/pytorch/issues/5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x24b (0x7fd6e851b15b in /usr/anaconda3/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame https://github.com/pytorch/pytorch/issues/6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x78 (0x7fd6e851b468 in /usr/anaconda3/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame https://github.com/pytorch/pytorch/issues/7: + 0xdbbf4 (0x7fd72d0dbbf4 in /usr/anaconda3/envs/train/bin/../lib/libstdc++.so.6) frame https://github.com/pytorch/pytorch/pull/8: + 0x94ac3 (0x7fd80a894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame https://github.com/pytorch/pytorch/pull/9: + 0x126a40 (0x7fd80a926a40 in /lib/x86_64-linux-gnu/libc.so.6)
Versions ENV List :
Python3.10.12 cuda version = 12.1.1 cudnn version = 8.9.2.26 nccl version = 2.18.1
absl-py==2.0.0 accelerate==0.24.1 aiohttp==3.8.6 aiosignal==1.3.1 async-timeout==4.0.3 attrs==23.1.0 cachetools==5.3.2 certifi==2023.7.22 charset-normalizer==3.3.1 datasets==2.14.6 deepspeed==0.11.1 dill==0.3.7 filelock==3.13.1 frozenlist==1.4.0 fsspec==2023.10.0 google-auth==2.23.3 google-auth-oauthlib==1.1.0 grpcio==1.59.2 hjson==3.1.0 huggingface-hub==0.17.3 idna==3.4 Jinja2==3.1.2 Markdown==3.5 MarkupSafe==2.1.3 mpmath==1.3.0 multidict==6.0.4 multiprocess==0.70.15 networkx==3.2.1 ninja==1.11.1.1 numpy==1.26.1 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==8.9.2.26 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-nccl-cu12==2.18.1 nvidia-nvjitlink-cu12==12.3.52 nvidia-nvtx-cu12==12.1.105 oauthlib==3.2.2 packaging==23.2 pandas==2.1.2 Pillow==10.1.0 protobuf==3.20.3 psutil==5.9.6 py-cpuinfo==9.0.0 pyarrow==13.0.0 pyasn1==0.5.0 pyasn1-modules==0.3.0 pydantic==1.10.13 python-dateutil==2.8.2 pytz==2023.3.post1 PyYAML==6.0.1 regex==2023.10.3 requests==2.31.0 requests-oauthlib==1.3.1 rsa==4.9 safetensors==0.4.0 sentencepiece==0.1.99 six==1.16.0 sympy==1.12 tensorboard==2.15.0 tensorboard-data-server==0.7.2 tokenizers==0.14.1 torch==2.1.0 torchaudio==2.1.0 torchvision==0.16.0 tqdm==4.66.1 transformers==4.35.0 triton==2.1.0 typing_extensions==4.8.0 tzdata==2023.3 urllib3==2.0.7 Werkzeug==3.0.1 xxhash==3.4.1 yarl==1.9.2