sglang
sglang copied to clipboard
RuntimeError: CUDA error: device-side assert triggered when running
[2024-03-10 10:31:21,586] [ ERROR] model_rpc.py:178 - Exception in ModelRpcClient:
Traceback (most recent call last):
File "/venv/lib/python3.9/site-packages/sglang/srt/managers/router/model_rpc.py", line 176, in exposed_step
self.forward_step()
File "/venv/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/venv/lib/python3.9/site-packages/sglang/srt/managers/router/model_rpc.py", line 187, in forward_step
new_batch = self.get_new_fill_batch()
File "/venv/lib/python3.9/site-packages/sglang/srt/managers/router/model_rpc.py", line 285, in get_new_fill_batch
prefix_indices, last_node = self.tree_cache.match_prefix(req.input_ids)
File "/venv/lib/python3.9/site-packages/sglang/srt/managers/router/radix_cache.py", line 52, in match_prefix
value = torch.concat(value)
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Not out of VRAM because i set mem_fraction to 0.6 and it is only using 10/16gb on my V100 using torch 2.2.1 (custom built vllm on torch 2.2.1), llava1.6 7b with gptq 8bit (yes vllm recently merged support for gptq 8bit)
This happens at a random time after running 100-300 samples. Because its completely random, it cant be data issue
Cant seem to get rid of the error no matter what I do.
This happens when im using run batch and effective batch size >1. Looks like a race cond somewhere
Bump on this. We're running into it too. Would appreciate guidance.
Same problem and there are several other places where this happens.
I think it is due to misconfig of maximum context length in SGLang
https://github.com/sgl-project/sglang/issues/461#issuecomment-2123974167
@m0g1cian different issue here. I am getting value = torch.concat(value), not req_to_token. The model I use also don't have the ctx len mismatch issue here.
@m0g1cian different issue here. I am getting
value = torch.concat(value), notreq_to_token. The model I use also don't have the ctx len mismatch issue here.
I think I had a similar issue. The symptom I had involve multiple CUDA errors lol, and I found those errors are pretty consistent with those extra long prompts. I had no issue when I was running Qwen 1.5 (32K context len), but when I switched to Llama 3 (8K context len), I started to have these errors.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [4,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [4,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
Exception in ModelRpcClient:
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 196, in exposed_step
self.forward_step()
File "/usr/local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 226, in forward_step
self.forward_decode_batch(self.running_batch)
File "/usr/local/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 551, in forward_decode_batch
) = self.model_runner.forward(batch, ForwardMode.DECODE)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/sglang/srt/managers/router/model_runner.py", line 452, in forward
return self.forward_decode(batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/sglang/srt/managers/router/model_runner.py", line 407, in forward_decode
input_metadata = InputMetadata.create(
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/sglang/srt/managers/router/model_runner.py", line 191, in create
total_num_tokens = int(torch.sum(seq_lens))
^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception in ModelRpcClient:
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 196, in exposed_step
self.forward_step()
File "/usr/local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 226, in forward_step
self.forward_decode_batch(self.running_batch)
File "/usr/local/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 551, in forward_decode_batch
) = self.model_runner.forward(batch, ForwardMode.DECODE)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/sglang/srt/managers/router/model_runner.py", line 452, in forward
return self.forward_decode(batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/sglang/srt/managers/router/model_runner.py", line 407, in forward_decode
input_metadata = InputMetadata.create(
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/sglang/srt/managers/router/model_runner.py", line 191, in create
total_num_tokens = int(torch.sum(seq_lens))
^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
INFO: 127.0.0.1:37816 - "POST /generate HTTP/1.1" 200 OK
65%|██████▌ | 13/20 [01:07<00:07, 1.03s/it][A[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 1] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f371746d897 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f371741db25 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f3717545718 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f3718743e36 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f3718747f38 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f371874d5ac in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f371874e31c in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x1c220 (0x7f3762de7220 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x7ea5 (0x7f376f997ea5 in /usr/lib64/libpthread.so.0)
frame #9: clone + 0x6d (0x7f376efb79fd in /usr/lib64/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 1 Rank 1] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f371746d897 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f371741db25 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f3717545718 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f3718743e36 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f3718747f38 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f371874d5ac in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f371874e31c in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x1c220 (0x7f3762de7220 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x7ea5 (0x7f376f997ea5 in /usr/lib64/libpthread.so.0)
frame #9: clone + 0x6d (0x7f376efb79fd in /usr/lib64/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f371746d897 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32e33 (0x7f37183d0e33 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x1c220 (0x7f3762de7220 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x7ea5 (0x7f376f997ea5 in /usr/lib64/libpthread.so.0)
frame #4: clone + 0x6d (0x7f376efb79fd in /usr/lib64/libc.so.6)
Exception in ModelRpcClient:
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 196, in exposed_step
self.forward_step()
File "/usr/local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 207, in forward_step
new_batch = self.get_new_fill_batch()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 313, in get_new_fill_batch
prefix_indices, last_node = self.tree_cache.match_prefix(req.input_ids)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/sglang/srt/managers/router/radix_cache.py", line 54, in match_prefix
value = torch.concat(value)
^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 0] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f371746d897 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f371741db25 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f3717545718 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f3718743e36 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f3718747f38 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f371874d5ac in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f371874e31c in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x1c220 (0x7f3762de7220 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x7ea5 (0x7f376f997ea5 in /usr/lib64/libpthread.so.0)
frame #9: clone + 0x6d (0x7f376efb79fd in /usr/lib64/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 1 Rank 0] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f371746d897 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f371741db25 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f3717545718 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f3718743e36 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f3718747f38 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f371874d5ac in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f371874e31c in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x1c220 (0x7f3762de7220 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x7ea5 (0x7f376f997ea5 in /usr/lib64/libpthread.so.0)
frame #9: clone + 0x6d (0x7f376efb79fd in /usr/lib64/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f371746d897 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32e33 (0x7f37183d0e33 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x1c220 (0x7f3762de7220 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x7ea5 (0x7f376f997ea5 in /usr/lib64/libpthread.so.0)
frame #4: clone + 0x6d (0x7f376efb79fd in /usr/lib64/libc.so.6)
Im using at 4x ctx len
This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.