vllm [Bug]: NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered

Your current environment

vllm 0.4.0.post1 docker image

how ran:

docker run -d \
    --runtime=nvidia \
    --gpus '"device=0,1"' \
    --shm-size=10.24gb \
    -p 5002:5002 \
        -e NCCL_IGNORE_DISABLED_P2P=1 \
    -v /etc/passwd:/etc/passwd:ro \
    -v /etc/group:/etc/group:ro \
    -u `id -u`:`id -g` \
    -v "${HOME}"/.cache:/home/ubuntu/.cache/ -v "${HOME}"/.config:/home/ubuntu/.config/ -v "${HOME}"/.config:/home/ubuntu/.triton/  \
    --network host \
    vllm/vllm-openai:latest \
        --port=5002 \
        --host=0.0.0.0 \
        --model=mistralai/Mixtral-8x7B-Instruct-v0.1 \
        --seed 1234 \
        --trust-remote-code \
        --tensor-parallel-size=2 \
        --dtype auto \
        --max-num-batched-tokens 131072 \
        --max-log-len=100 \
        --download-dir=/home/ubuntu/.cache/huggingface/hub &>> logs.vllm_server.2gpus.mixtral.txt

On:

Collecting environment information...
PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-97-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.3.107
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A100 80GB PCIe
GPU 1: NVIDIA A100 80GB PCIe
GPU 2: NVIDIA A100 80GB PCIe
GPU 3: NVIDIA A100 80GB PCIe

Nvidia driver version: 535.161.07
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      48 bits physical, 48 bits virtual
CPU(s):                             126
On-line CPU(s) list:                0-125
Thread(s) per core:                 1
Core(s) per socket:                 126
Socket(s):                          1
NUMA node(s):                       1
Vendor ID:                          AuthenticAMD
CPU family:                         25
Model:                              1
Model name:                         AMD EPYC 7763 64-Core Processor
Stepping:                           1
CPU MHz:                            2445.406
BogoMIPS:                           4890.81
Virtualization:                     AMD-V
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          7.9 MiB
L1i cache:                          7.9 MiB
L2 cache:                           63 MiB
L3 cache:                           16 MiB
NUMA node0 CPU(s):                  0-125
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat npt nrip_save umip pku ospke vaes vpclmulqdq rdpid fsrm arch_capabilities

Versions of relevant libraries:
[pip3] numpy==1.26.3
[pip3] torch==2.1.2
[pip3] triton==2.1.0
[conda] numpy                     1.26.3                   pypi_0    pypi
[conda] torch                     2.1.2                    pypi_0    pypi
[conda] triton                    2.1.0                    pypi_0    pypiROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.2.7
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PHB     PHB     PHB     0-125   0               N/A
GPU1    PHB      X      PHB     PHB     0-125   0               N/A
GPU2    PHB     PHB      X      PHB     0-125   0               N/A
GPU3    PHB     PHB     PHB      X      0-125   0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

After 5 days of being up, eventually hit this. Note the endpoint was heavily used for all 5 days, nothing special apart from maybe more guided_json stuff today.

NFO 04-16 00:03:23 metrics.py:218] Avg prompt throughput: 2824.3 tokens/s, Avg generation throughput: 0.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 10.2%, CPU KV cache usage: 0.0%
[36m(RayWorkerVllm pid=7046)[0m [E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
[36m(RayWorkerVllm pid=7046)[0m CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[36m(RayWorkerVllm pid=7046)[0m For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[36m(RayWorkerVllm pid=7046)[0m Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[36m(RayWorkerVllm pid=7046)[0m 
[36m(RayWorkerVllm pid=7046)[0m Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
[36m(RayWorkerVllm pid=7046)[0m frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5144192617 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
[36m(RayWorkerVllm pid=7046)[0m frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f514414d98d in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
[36m(RayWorkerVllm pid=7046)[0m frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f5144530128 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
[36m(RayWorkerVllm pid=7046)[0m frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7f4da9f2f250 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
[36m(RayWorkerVllm pid=7046)[0m frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f4da9f33078 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
[36m(RayWorkerVllm pid=7046)[0m frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x250 (0x7f4da9f49910 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
[36m(RayWorkerVllm pid=7046)[0m frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x78 (0x7f4da9f49c18 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
[36m(RayWorkerVllm pid=7046)[0m frame #7: <unknown function> + 0xdc253 (0x7f5149847253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
[36m(RayWorkerVllm pid=7046)[0m frame #8: <unknown function> + 0x94ac3 (0x7f514b686ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[36m(RayWorkerVllm pid=7046)[0m frame #9: clone + 0x44 (0x7f514b717a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[36m(RayWorkerVllm pid=7046)[0m 
[36m(RayWorkerVllm pid=7046)[0m [2024-04-16 00:03:25,069 E 7046 7269] logging.cc:97: Unhandled exception: St13runtime_error. what(): [Rank 1] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
[36m(RayWorkerVllm pid=7046)[0m CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[36m(RayWorkerVllm pid=7046)[0m For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[36m(RayWorkerVllm pid=7046)[0m Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[36m(RayWorkerVllm pid=7046)[0m 
[36m(RayWorkerVllm pid=7046)[0m Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
[36m(RayWorkerVllm pid=7046)[0m frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5144192617 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
[36m(RayWorkerVllm pid=7046)[0m frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f514414d98d in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
[36m(RayWorkerVllm pid=7046)[0m frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f5144530128 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
[36m(RayWorkerVllm pid=7046)[0m frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7f4da9f2f250 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
[36m(RayWorkerVllm pid=7046)[0m frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f4da9f33078 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
[36m(RayWorkerVllm pid=7046)[0m frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x250 (0x7f4da9f49910 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
[36m(RayWorkerVllm pid=7046)[0m frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x78 (0x7f4da9f49c18 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
[36m(RayWorkerVllm pid=7046)[0m frame #7: <unknown function> + 0xdc253 (0x7f5149847253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
[36m(RayWorkerVllm pid=7046)[0m frame #8: <unknown function> + 0x94ac3 (0x7f514b686ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[36m(RayWorkerVllm pid=7046)[0m frame #9: clone + 0x44 (0x7f514b717a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[36m(RayWorkerVllm pid=7046)[0m 
[36m(RayWorkerVllm pid=7046)[0m [E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
[36m(RayWorkerVllm pid=7046)[0m CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[36m(RayWorkerVllm pid=7046)[0m For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[36m(RayWorkerVllm pid=7046)[0m Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[36m(RayWorkerVllm pid=7046)[0m 
[36m(RayWorkerVllm pid=7046)[0m Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
[36m(RayWorkerVllm pid=7046)[0m frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5144192617 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
[36m(RayWorkerVllm pid=7046)[0m frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f514414d98d in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
[36m(RayWorkerVllm pid=7046)[0m frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f5144530128 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
[36m(RayWorkerVllm pid=7046)[0m frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7f4da9f2f250 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
[36m(RayWorkerVllm pid=7046)[0m frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f4da9f33078 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
[36m(RayWorkerVllm pid=7046)[0m frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x250 (0x7f4da9f49910 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
[36m(RayWorkerVllm pid=7046)[0m frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x78 (0x7f4da9f49c18 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
[36m(RayWorkerVllm pid=7046)[0m frame #7: <unknown function> + 0xdc253 (0x7f5149847253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
[36m(RayWorkerVllm pid=7046)[0m frame #8: <unknown function> + 0x94ac3 (0x7f514b686ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[36m(RayWorkerVllm pid=7046)[0m frame #9: clone + 0x44 (0x7f514b717a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[36m(RayWorkerVllm pid=7046)[0m 
[36m(RayWorkerVllm pid=7046)[0m [2024-04-16 00:03:25,071 E 7046 7284] logging.cc:97: Unhandled exception: St13runtime_error. what(): [Rank 1] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
[36m(RayWorkerVllm pid=7046)[0m CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[36m(RayWorkerVllm pid=7046)[0m For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[36m(RayWorkerVllm pid=7046)[0m Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[36m(RayWorkerVllm pid=7046)[0m 
[36m(RayWorkerVllm pid=7046)[0m Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
[36m(RayWorkerVllm pid=7046)[0m frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5144192617 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
[36m(RayWorkerVllm pid=7046)[0m frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f514414d98d in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
[36m(RayWorkerVllm pid=7046)[0m frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f5144530128 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
[36m(RayWorkerVllm pid=7046)[0m frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7f4da9f2f250 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
[36m(RayWorkerVllm pid=7046)[0m frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f4da9f33078 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
[36m(RayWorkerVllm pid=7046)[0m frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x250 (0x7f4da9f49910 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
[36m(RayWorkerVllm pid=7046)[0m frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x78 (0x7f4da9f49c18 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
[36m(RayWorkerVllm pid=7046)[0m frame #7: <unknown function> + 0xdc253 (0x7f5149847253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
[36m(RayWorkerVllm pid=7046)[0m frame #8: <unknown function> + 0x94ac3 (0x7f514b686ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[36m(RayWorkerVllm pid=7046)[0m frame #9: clone + 0x44 (0x7f514b717a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[36m(RayWorkerVllm pid=7046)[0m 
[36m(RayWorkerVllm pid=7046)[0m [2024-04-16 00:03:25,080 E 7046 7269] logging.cc:104: Stack trace: 
[36m(RayWorkerVllm pid=7046)[0m  /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0xfe543a) [0x7f514a97c43a] ray::operator<<()
[36m(RayWorkerVllm pid=7046)[0m /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0xfe7b78) [0x7f514a97eb78] ray::TerminateHandler()
[36m(RayWorkerVllm pid=7046)[0m /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c) [0x7f514981920c]
[36m(RayWorkerVllm pid=7046)[0m /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277) [0x7f5149819277]
[36m(RayWorkerVllm pid=7046)[0m /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae1fe) [0x7f51498191fe]
[36m(RayWorkerVllm pid=7046)[0m /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0xc86f5b) [0x7f4da9cb4f5b] c10d::ProcessGroupNCCL::ncclCommWatchdog()
[36m(RayWorkerVllm pid=7046)[0m /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f5149847253]
[36m(RayWorkerVllm pid=7046)[0m /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f514b686ac3]
[36m(RayWorkerVllm pid=7046)[0m /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44) [0x7f514b717a04] __clone
[36m(RayWorkerVllm pid=7046)[0m 
[36m(RayWorkerVllm pid=7046)[0m *** SIGABRT received at time=1713225805 on cpu 21 ***
[36m(RayWorkerVllm pid=7046)[0m PC: @     0x7f514b6889fc  (unknown)  pthread_kill
[36m(RayWorkerVllm pid=7046)[0m     @     0x7f514b634520  (unknown)  (unknown)
[36m(RayWorkerVllm pid=7046)[0m [2024-04-16 00:03:25,080 E 7046 7269] logging.cc:361: *** SIGABRT received at time=1713225805 on cpu 21 ***
[36m(RayWorkerVllm pid=7046)[0m [2024-04-16 00:03:25,080 E 7046 7269] logging.cc:361: PC: @     0x7f514b6889fc  (unknown)  pthread_kill
[36m(RayWorkerVllm pid=7046)[0m [2024-04-16 00:03:25,080 E 7046 7269] logging.cc:361:     @     0x7f514b634520  (unknown)  (unknown)
[36m(RayWorkerVllm pid=7046)[0m Fatal Python error: Aborted
[36m(RayWorkerVllm pid=7046)[0m 
[36m(RayWorkerVllm pid=7046)[0m 
[36m(RayWorkerVllm pid=7046)[0m Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, charset_normalizer.md, simplejson._speedups, uvloop.loop, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, sentencepiece._sentencepiece, pyarrow.lib, pyarrow._hdfsio, pyarrow._json, PIL._imaging, __triton_launcher, cuda_utils (total: 37)
[E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f48b76de617 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f48b769998d in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f48b779a128 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7f48432c5250 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f48432c9078 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x250 (0x7f48432df910 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x78 (0x7f48432dfc18 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7f4887ab0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7f48c3b30ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7f48c3bc1a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[2024-04-16 00:03:25,191 E 1 7270] logging.cc:97: Unhandled exception: St13runtime_error. what(): [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f48b76de617 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f48b769998d in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f48b779a128 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7f48432c5250 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f48432c9078 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x250 (0x7f48432df910 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x78 (0x7f48432dfc18 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7f4887ab0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7f48c3b30ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7f48c3bc1a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f48b76de617 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f48b769998d in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f48b779a128 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7f48432c5250 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f48432c9078 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x250 (0x7f48432df910 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x78 (0x7f48432dfc18 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7f4887ab0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7f48c3b30ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7f48c3bc1a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[2024-04-16 00:03:25,207 E 1 7285] logging.cc:97: Unhandled exception: St13runtime_error. what(): [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f48b76de617 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f48b769998d in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f48b779a128 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7f48432c5250 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f48432c9078 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x250 (0x7f48432df910 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x78 (0x7f48432dfc18 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7f4887ab0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7f48c3b30ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7f48c3bc1a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

ERROR 04-16 00:03:25 async_llm_engine.py:43] Engine background task failed
ERROR 04-16 00:03:25 async_llm_engine.py:43] Traceback (most recent call last):
ERROR 04-16 00:03:25 async_llm_engine.py:43]   File "/workspace/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
ERROR 04-16 00:03:25 async_llm_engine.py:43]     task.result()
ERROR 04-16 00:03:25 async_llm_engine.py:43]   File "/workspace/vllm/engine/async_llm_engine.py", line 479, in run_engine_loop
ERROR 04-16 00:03:25 async_llm_engine.py:43]     has_requests_in_progress = await asyncio.wait_for(
ERROR 04-16 00:03:25 async_llm_engine.py:43]   File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
ERROR 04-16 00:03:25 async_llm_engine.py:43]     return fut.result()
ERROR 04-16 00:03:25 async_llm_engine.py:43]   File "/workspace/vllm/engine/async_llm_engine.py", line 453, in engine_step
ERROR 04-16 00:03:25 async_llm_engine.py:43]     request_outputs = await self.engine.step_async()
ERROR 04-16 00:03:25 async_llm_engine.py:43]   File "/workspace/vllm/engine/async_llm_engine.py", line 213, in step_async
ERROR 04-16 00:03:25 async_llm_engine.py:43]     output = await self.model_executor.execute_model_async(
ERROR 04-16 00:03:25 async_llm_engine.py:43]   File "/workspace/vllm/executor/ray_gpu_executor.py", line 422, in execute_model_async
ERROR 04-16 00:03:25 async_llm_engine.py:43]     all_outputs = await self._run_workers_async(
ERROR 04-16 00:03:25 async_llm_engine.py:43]   File "/workspace/vllm/executor/ray_gpu_executor.py", line 412, in _run_workers_async
ERROR 04-16 00:03:25 async_llm_engine.py:43]     all_outputs = await asyncio.gather(*coros)
ERROR 04-16 00:03:25 async_llm_engine.py:43]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR 04-16 00:03:25 async_llm_engine.py:43]     result = self.fn(*self.args, **self.kwargs)
ERROR 04-16 00:03:25 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 04-16 00:03:25 async_llm_engine.py:43]     return func(*args, **kwargs)
ERROR 04-16 00:03:25 async_llm_engine.py:43]   File "/workspace/vllm/worker/worker.py", line 221, in execute_model
ERROR 04-16 00:03:25 async_llm_engine.py:43]     output = self.model_runner.execute_model(seq_group_metadata_list,
ERROR 04-16 00:03:25 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 04-16 00:03:25 async_llm_engine.py:43]     return func(*args, **kwargs)
ERROR 04-16 00:03:25 async_llm_engine.py:43]   File "/workspace/vllm/worker/model_runner.py", line 673, in execute_model
ERROR 04-16 00:03:25 async_llm_engine.py:43]     output = self.model.sample(
ERROR 04-16 00:03:25 async_llm_engine.py:43]   File "/workspace/vllm/model_executor/models/mixtral.py", line 394, in sample
ERROR 04-16 00:03:25 async_llm_engine.py:43]     next_tokens = self.sampler(logits, sampling_metadata)
ERROR 04-16 00:03:25 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
ERROR 04-16 00:03:25 async_llm_engine.py:43]     return self._call_impl(*args, **kwargs)
ERROR 04-16 00:03:25 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
ERROR 04-16 00:03:25 async_llm_engine.py:43]     return forward_call(*args, **kwargs)
ERROR 04-16 00:03:25 async_llm_engine.py:43]   File "/workspace/vllm/model_executor/layers/sampler.py", line 76, in forward
ERROR 04-16 00:03:25 async_llm_engine.py:43]     sample_results = _sample(probs, logprobs, sampling_metadata,
ERROR 04-16 00:03:25 async_llm_engine.py:43]   File "/workspace/vllm/model_executor/layers/sampler.py", line 502, in _sample
ERROR 04-16 00:03:25 async_llm_engine.py:43]     return _sample_with_torch(probs, logprobs, sampling_metadata)
ERROR 04-16 00:03:25 async_llm_engine.py:43]   File "/workspace/vllm/model_executor/layers/sampler.py", line 399, in _sample_with_torch
ERROR 04-16 00:03:25 async_llm_engine.py:43]     sample_results = _greedy_sample(seq_groups, greedy_samples)
ERROR 04-16 00:03:25 async_llm_engine.py:43]   File "/workspace/vllm/model_executor/layers/sampler.py", line 214, in _greedy_sample
ERROR 04-16 00:03:25 async_llm_engine.py:43]     samples = samples.tolist()
ERROR 04-16 00:03:25 async_llm_engine.py:43] RuntimeError: CUDA error: an illegal memory access was encountered
ERROR 04-16 00:03:25 async_llm_engine.py:43] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 04-16 00:03:25 async_llm_engine.py:43] For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
ERROR 04-16 00:03:25 async_llm_engine.py:43] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 04-16 00:03:25 async_llm_engine.py:43] 
ERROR:asyncio:Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7f47751041f0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f479b4fc910>>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7f47751041f0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f479b4fc910>>)>
Traceback (most recent call last):
  File "/workspace/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
    task.result()
  File "/workspace/vllm/engine/async_llm_engine.py", line 479, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/workspace/vllm/engine/async_llm_engine.py", line 453, in engine_step
    request_outputs = await self.engine.step_async()
  File "/workspace/vllm/engine/async_llm_engine.py", line 213, in step_async
    output = await self.model_executor.execute_model_async(
  File "/workspace/vllm/executor/ray_gpu_executor.py", line 422, in execute_model_async
    all_outputs = await self._run_workers_async(
  File "/workspace/vllm/executor/ray_gpu_executor.py", line 412, in _run_workers_async
    all_outputs = await asyncio.gather(*coros)
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/vllm/worker/worker.py", line 221, in execute_model
    output = self.model_runner.execute_model(seq_group_metadata_list,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/vllm/worker/model_runner.py", line 673, in execute_model
    output = self.model.sample(
  File "/workspace/vllm/model_executor/models/mixtral.py", line 394, in sample
    next_tokens = self.sampler(logits, sampling_metadata)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/model_executor/layers/sampler.py", line 76, in forward
    sample_results = _sample(probs, logprobs, sampling_metadata,
  File "/workspace/vllm/model_executor/layers/sampler.py", line 502, in _sample
    return _sample_with_torch(probs, logprobs, sampling_metadata)
  File "/workspace/vllm/model_executor/layers/sampler.py", line 399, in _sample_with_torch
    sample_results = _greedy_sample(seq_groups, greedy_samples)
  File "/workspace/vllm/model_executor/layers/sampler.py", line 214, in _greedy_sample
    samples = samples.tolist()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/workspace/vllm/engine/async_llm_engine.py", line 45, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
INFO 04-16 00:03:25 async_llm_engine.py:154] Aborted request cmpl-dfc7112541c14e93b9996e354d51fe7e-0.
INFO 04-16 00:03:25 async_llm_engine.py:154] Aborted request cmpl-7198ad674747410698402a13a1000014-0.
INFO 04-16 00:03:25 async_llm_engine.py:154] Aborted request cmpl-7f59bec40147410fbd3598b48c7c3d09-0.
INFO 04-16 00:03:25 async_llm_engine.py:154] Aborted request cmpl-834ce903dfa0491b9bc94e76acc1bb02-0.
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 265, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 238, in listen_for_disconnect
    message = await receive()
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 568, in receive
    await self.message_event.wait()
  File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f37e4599390

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 75, in app
    await response(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 258, in __call__
    async with anyio.create_task_group() as task_group:
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 678, in __aexit__
    raise BaseExceptionGroup(
exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 265, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 238, in listen_for_disconnect
    message = await receive()
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 568, in receive
    await self.message_event.wait()
  File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f37dc1f5570

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 75, in app
    await response(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 258, in __call__
    async with anyio.create_task_group() as task_group:
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 678, in __aexit__
    raise BaseExceptionGroup(
exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 265, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 238, in listen_for_disconnect
    message = await receive()
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 568, in receive
    await self.message_event.wait()
  File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f37dc1f5960

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 75, in app
    await response(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 258, in __call__
    async with anyio.create_task_group() as task_group:
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 678, in __aexit__
    raise BaseExceptionGroup(
exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 265, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 238, in listen_for_disconnect
    message = await receive()
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 568, in receive
    await self.message_event.wait()
  File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f37e4703190

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 75, in app
    await response(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 258, in __call__
    async with anyio.create_task_group() as task_group:
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 678, in __aexit__
    raise BaseExceptionGroup(
exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
[2024-04-16 00:03:25,217 E 1 7270] logging.cc:104: Stack trace: 
 /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0xfe543a) [0x7f47749a243a] ray::operator<<()
/usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0xfe7b78) [0x7f47749a4b78] ray::TerminateHandler()
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c) [0x7f4887a8220c]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277) [0x7f4887a82277]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae1fe) [0x7f4887a821fe]
/usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0xc86f5b) [0x7f484304af5b] c10d::ProcessGroupNCCL::ncclCommWatchdog()
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f4887ab0253]
/usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f48c3b30ac3]
/usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44) [0x7f48c3bc1a04] __clone

*** SIGABRT received at time=1713225805 on cpu 77 ***
PC: @     0x7f48c3b329fc  (unknown)  pthread_kill
    @     0x7f48c3ade520  (unknown)  (unknown)
[2024-04-16 00:03:25,217 E 1 7270] logging.cc:361: *** SIGABRT received at time=1713225805 on cpu 77 ***
[2024-04-16 00:03:25,217 E 1 7270] logging.cc:361: PC: @     0x7f48c3b329fc  (unknown)  pthread_kill
[2024-04-16 00:03:25,217 E 1 7270] logging.cc:361:     @     0x7f48c3ade520  (unknown)  (unknown)
Fatal Python error: Aborted


Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, charset_normalizer.md, simplejson._speedups, yaml._yaml, sentencepiece._sentencepiece, psutil._psutil_linux, psutil._psutil_posix, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, regex._regex, scipy._lib._ccallback_c, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, markupsafe._speedups, pyarrow.lib, pyarrow._hdfsio, pyarrow._json, PIL._imaging, __triton_launcher, cuda_utils, httptools.parser.parser, httptools.parser.url_parser, websockets.speedups, _cffi_backend, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering (total: 77)
[failure_signal_handler.cc : 332] RAW: Signal 11 raised at PC=0x7f48c3ac4898 while already in AbslFailureSignalHandler()
*** SIGSEGV received at time=1713225805 on cpu 77 ***
PC: @     0x7f48c3ac4898  (unknown)  abort
    @     0x7f48c3ade520  (unknown)  (unknown)
[2024-04-16 00:03:25,219 E 1 7285] logging.cc:104: Stack trace: 
 /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0xfe543a) [0x7f47749a243a] ray::operator<<()
/usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0xfe7b78) [0x7f47749a4b78] ray::TerminateHandler()
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c) [0x7f4887a8220c]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277) [0x7f4887a82277]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae1fe) [0x7f4887a821fe]
/usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0xc86f5b) [0x7f484304af5b] c10d::ProcessGroupNCCL::ncclCommWatchdog()
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f4887ab0253]
/usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f48c3b30ac3]
/usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44) [0x7f48c3bc1a04] __clone

    @     0x7f46e4c14640  (unknown)  (unknown)
[2024-04-16 00:03:25,221 E 1 7270] logging.cc:361: *** SIGSEGV received at time=1713225805 on cpu 77 ***
[2024-04-16 00:03:25,221 E 1 7270] logging.cc:361: PC: @     0x7f48c3ac4898  (unknown)  abort
[2024-04-16 00:03:25,221 E 1 7270] logging.cc:361:     @     0x7f48c3ade520  (unknown)  (unknown)
[2024-04-16 00:03:25,223 E 1 7270] logging.cc:361:     @     0x7f46e4c14640  (unknown)  (unknown)
Fatal Python error: Segmentation fault


Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, charset_normalizer.md, simplejson._speedups, yaml._yaml, sentencepiece._sentencepiece, psutil._psutil_linux, psutil._psutil_posix, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, regex._regex, scipy._lib._ccallback_c, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, markupsafe._speedups, pyarrow.lib, pyarrow._hdfsio, pyarrow._json, PIL._imaging, __triton_launcher, cuda_utils, httptools.parser.parser, httptools.parser.url_parser, websockets.speedups, _cffi_backend, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering (total: 77)

Apr 16 '24 07:04 pseudotensor

Experiencing the same when using LoRa requests...

Apr 18 '24 03:04 roger-creus

Experiencing the same when using LoRa requests...

Hi! The same with you when using LoRa. Do you have a solution?

Apr 18 '24 11:04 meizhen-nlp

When I load the llama model, some GPU will do this and others will be fine

Apr 20 '24 09:04 hl0929

I'm also seeing same issues on a clean server installation in GCP. My steps to reproduce were:

run instance in google using c0-deeplearning-common-cu121-v20240417-debian-11 image and 2xA100 40GB GPU
login, it asks for install driver, accept
check nvidia-smi - driver installed successfully
now I'm in clean environment with conda (base)
pip install vllm
optionally: pip install flash-attn
run vllm open api server (I used code llama model)
Got: CUDA error: an illegal memory access was encountered

May 04 '24 12:05 jsirex

Still seeing this on Mixtral

May 16 '24 18:05 pseudotensor

INFO:     172.16.0.88:2118 - "POST /v1/completions HTTP/1.1" 200 OK
[rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7c5ec7d7a897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7c5ec7d2ab25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7c5ec818b718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7c5e7ba4ae36 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7c5e7ba4ef38 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7c5e7ba545ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7c5e7ba5531c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7c5ec74b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7c5ec8a92ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7c5ec8b23a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7c5ec7d7a897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7c5ec7d2ab25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7c5ec818b718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7c5e7ba4ae36 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7c5e7ba4ef38 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7c5e7ba545ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7c5e7ba5531c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7c5ec74b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7c5ec8a92ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7c5ec8b23a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[2024-05-17 07:35:09,516 E 1 6539] logging.cc:101: Unhandled exception: N3c1016DistBackendErrorE. what(): [PG 1 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
:

May 17 '24 07:05 pseudotensor

Still see on totally different H100 system

May 17 '24 07:05 pseudotensor

same problem here with H100s and latest vllm==0.4.2

May 17 '24 18:05 stas00

@pseudotensor I have discovered an integer overflow in the fused_moe_kernel, a Triton kernel called by MoE models. The overflow will sometimes cause CUDA illegal memory access issues. I don't know if this overflow is the cause of your failure, but since you are using the Mixtral model (a MoE), you might be affected. If you'd like to check, you can add the following assertion here:

tl.device_assert(off_experts * stride_be >= 0, "off_experts * stride_be overflows!")

and then rerun your program with the following envs (should be set in the docker)CUDA_LAUNCH_BLOCKING=1 TRITON_DEBUG=1 set, and with the flag --enforce-eager passed to the docker entrypoint?

May 23 '24 21:05 sfc-gh-goliaro

Same problem here when running llama-7b with input_len >= 4096 tensor_parallel_size > 1, on a800 * 8. Did anyone solve it?

Jun 13 '24 03:06 lantel-wm

@pseudotensor I have discovered an integer overflow in the fused_moe_kernel, a Triton kernel called by MoE models. The overflow will sometimes cause CUDA illegal memory access issues. I don't know if this overflow is the cause of your failure, but since you are using the Mixtral model (a MoE), you might be affected. If you'd like to check, you can add the following assertion here:
tl.device_assert(off_experts * stride_be >= 0, "off_experts * stride_be overflows!")
and then rerun your program with the following envs (should be set in the docker)CUDA_LAUNCH_BLOCKING=1 TRITON_DEBUG=1 set, and with the flag --enforce-eager passed to the docker entrypoint?

same error occurs, did you solve it, or how to skip this ...

Jun 17 '24 11:06 learninmou

Still seeing this, only when using LoRa. I am currently using LLama3-8b, tensor_parallel_size=8 and max_model_len=1250. The same run without using LoRa works flawlessly.

This might be related: https://stackoverflow.com/questions/68106457/pytorch-cuda-error-an-illegal-memory-access-was-encountered

The root problem could be OOM because of the prefix caching. The solution in the post above is to use torch.cuda.empty_cache() so it would make sense

Jun 17 '24 13:06 roger-creus

Closing just because original mixtral model no longer has been doing this after 0.4.3+.

Jul 23 '24 22:07 pseudotensor