[Bug]: Llama-3.2-11B-Vision-Instruct server crashes when asked guided generation
Your current environment
The output of `python collect_env.py`
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.26.1
Libc version: glibc-2.31
Python version: 3.8.10 (default, Jul 29 2024, 17:02:10) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-6.8.0-40-generic-x86_64-with-glibc2.29
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB
Nvidia driver version: 550.90.07
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 48 bits physical, 48 bits virtual
CPU(s): 256
On-line CPU(s) list: 0-255
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2
NUMA node(s): 2
Vendor ID: AuthenticAMD
CPU family: 25
Model: 1
Model name: AMD EPYC 7763 64-Core Processor
Stepping: 1
Frequency boost: enabled
CPU MHz: 1500.000
CPU max MHz: 3529.0520
CPU min MHz: 1500.0000
BogoMIPS: 4899.71
Virtualization: AMD-V
L1d cache: 4 MiB
L1i cache: 4 MiB
L2 cache: 64 MiB
L3 cache: 512 MiB
NUMA node0 CPU(s): 0-63,128-191
NUMA node1 CPU(s): 64-127,192-255
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Mitigation; Safe RET
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin brs arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap
Versions of relevant libraries:
[pip3] numpy==1.24.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.68
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.45.1
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.3.dev28+gb0298aa8
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X 0-63,128-191 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks```
Model Input Dumps
No response
π Describe the bug
I am serving Llama-3.2-11B-Vision-Instruct on my 1 A100/80G GPU with the following instruction.
nohup vllm serve meta-llama/Llama-3.2-11B-Vision-Instruct --port 8000 --api-key qwen
2-4e1fbc5e56f7fbE1 --gpu-memory-utilization 0.9 --download_dir /workspace/vllm_models/ --cpu-offload-gb 5000 --swap-space 50 --max-model-le
n 4096 --max_num_seqs=32 --enforce_eager > llama_vision-output_240930.log 2>&1 &
The server crashes whenever I add response_format or guided_json parameter in my client.chat.completions.create() method.
D0 Inference
If structured output is not used, it crashes when trying to use 40 seconds
import asyncio
import nest_asyncio
from openai import AsyncOpenAI
# Allow nested event loops
nest_asyncio.apply()
async def run_batch_requests(img_urls, image_info_prompt):
client = AsyncOpenAI(
base_url=llama_v_url,
api_key=llama_v_api_key,
)
model = llama_v_model
async def single_request(img_url, image_info_prompt):
messages = [
# {"role": "system", "content": d0_system_prompt},
{"role": "user", "content": [{"type": "text", "text": image_info_prompt}, # Error: Prompting with images is incompatible with system messages.
{"type": "image_url", "image_url": {"url": img_url}}]
}
]
try:
response = await client.chat.completions.create(
model=model,
messages=messages,
max_tokens=1000,
# response_format=dict(type="json_object") # Crashes
extra_body=dict(guided_json=d0_schema) # Crashes when applied to llama model
)
return response.choices[0].message.content
except Exception as e:
return f"Error:{str(e)}"
tasks = [single_request(url, image_info_prompt) for url in img_urls]
results = await asyncio.gather(*tasks)
return results
# Example usage
img_urls = images
# Run the asynchronous function synchronously
results = asyncio.get_event_loop().run_until_complete(run_batch_requests(img_urls, image_info_prompt, is_llama_v=True))
Before submitting a new issue...
- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Can you provide a runable client-side script? This code misses a lot of variables.
@heheda12345 The following is the running script. If I uncomment "response_format", the server crashes.
import asyncio
import nest_asyncio
from openai import AsyncOpenAI
nest_asyncio.apply()
async def run_batch_requests(img_urls, user_prompt_template):
client = AsyncOpenAI(
base_url=[url],
api_key=[api_key] ,
)
model = "meta-llama/Llama-3.2-11B-Vision-Instruct"
async def single_request(img_url):
messages = [
{"role": "user", "content": [
{"type": "text", "text": user_prompt_template},
{"type": "image_url", "image_url": {"url": img_url}}
]}
]
try:
response = await client.chat.completions.create(
model=model,
messages=messages,
max_tokens=1000,
temperature=0.1,
# response_format=dict(type="json_object")
)
return response.choices[0].message.content
except Exception as e:
return f"μ΄λ―Έμ§ μ²λ¦¬ μ€ μ€λ₯ λ°μ {img_url}: {str(e)}"
tasks = [single_request(url) for url in img_urls]
results = await asyncio.gather(*tasks)
return results
# μ¬μ© μμ
img_urls = image_urls = [
"https://unsplash.com/photos/8manzosDSGM/download?force=true",
"https://unsplash.com/photos/yC-Yzbqy7PY/download?force=true",
"https://unsplash.com/photos/82TpEld0_e4/download?force=true",
"https://unsplash.com/photos/wawEfYdpkag/download?force=true",
"https://unsplash.com/photos/xMSxY4WWQkE/download?force=true",
"https://unsplash.com/photos/hpjSkU2UYSU/download?force=true",
"https://unsplash.com/photos/TkXJoA_sn1w/download?force=true",
"https://unsplash.com/photos/q54Oxq44MZs/download?force=true",
"https://unsplash.com/photos/8mikJ83LmSQ/download?force=true",
"https://unsplash.com/photos/CAm0Ht0rBMw/download?force=true"
]
user_prompt_template ="""
[INSTRUCTIONS]
1. **Describe the given image in detail.**
- Provide a comprehensive description of the image, including all relevant details such as objects, scenes, actions, colors, textures, and any other notable elements.
- Be as specific as possible to capture the essence of the image.
2. **Indicate the types and counts of objects appearing in the given image in JSON format.**
- Use the following schema for the JSON output:
```json
{
"image_info": "<Detailed description of the image>",
"object": {
"object_1": count,
"object_2": count,
...
}
}
```
- Replace `<Detailed description of the image>` with the description from step 1.
- List each object type as a key in the `"object"` dictionary, with the corresponding count as the value.
- Ensure that object names are clear and consistent (e.g., "tree", "person", "car").
**[NOTE]**
- Make sure the JSON output is properly formatted and valid.
- Only include objects that are clearly visible and identifiable in the image.
- The counts should be integers representing the number of times each object appears in the image.
- Do not include any additional text outside of the JSON output.
- If unsure about an object, you may include it with a note in the description but avoid listing uncertain objects in the JSON.
**Example Output:**
```json
{
"image_info": "A serene beach scene at sunset with two palm trees silhouetted against the orange sky, gentle waves lapping at the shore, and a small boat anchored near the horizon.",
"object": {
"palm tree": 2,
"boat": 1,
"wave": 5
}
}
"""
results = asyncio.get_event_loop().run_until_complete(run_batch_requests(img_urls, user_prompt_template))
results
So the issue is that llama-3.2-vision models have this extra token <|image|> with idx 128256 (0-indexed). The scores are generated for 128256 (1 token short). The actual error is index error here: https://github.com/vllm-project/vllm/blob/22f5851b807376a836eb3551903c7fc6c81eaa9b/vllm/model_executor/guided_decoding/outlines_logits_processors.py#L82
You have a tensor shaped (128256,), but your allowed tokens may include this last illegal token (since I don't believe it is intended for generation).
I went on and added disallowed tokens on my side to rectify it, hosting my own model. But there's definitely inconsistency between the llama config and the actual vocab size and/or broken behaviour for this special token.
@pavlo-ruban
You have a tensor shaped (128256,), but your allowed tokens may include this last illegal token (since I don't believe it is intended for generation). I went on and added disallowed tokens on my side to rectify it, hosting my own model. But there's definitely inconsistency between the llama config and the actual vocab size and/or broken behaviour for this special token.
Thanks to your advice, I add this line of code.
allowed_tokens = [token for token in allowed_tokens if token != 128256]
mask[allowed_tokens] = 0
Is is correct way to do it? The server doesn't crash, but I couldn't get structured output.
@miridih-jhkim11 I went with allowed_tokens = [t for t in allowed_tokens if t < scores.shape[-1]], ran into graph problem when trying to compare the token like you are doing, it was to do with graph, something failed along self.cuda_graph.capture_end(). Do you mean you are getting the response, but not structured, or getting a structurd response with empty values?
So the issue is that llama-3.2-vision models have this extra token
<|image|>with idx 128256 (0-indexed). The scores are generated for 128256 (1 token short). The actual error is index error here:https://github.com/vllm-project/vllm/blob/22f5851b807376a836eb3551903c7fc6c81eaa9b/vllm/model_executor/guided_decoding/outlines_logits_processors.py#L82
You have a tensor shaped
(128256,), but your allowed tokens may include this last illegal token (since I don't believe it is intended for generation). I went on and added disallowed tokens on my side to rectify it, hosting my own model. But there's definitely inconsistency between the llama config and the actual vocab size and/or broken behaviour for this special token.
thanks for this, it worked with the following patch
allowed_tokens = [t for t in allowed_tokens if t < scores.shape[-1]]
mask[allowed_tokens] = 0
i think you need to run --enforce-eager so that cuda graphs are not compiled
I'm having a similar (ish) problem with json output. When I add "response_format": {"type": "json_object"} to my request vllm crashes with this stack trace:
INFO 10-16 13:24:32 engine.py:288] Added request chat-bc566e62dd194175af7fe161cad40248.
Compiling FSM index for all state transitions: 100%|ββββββββββ| 3/3 [00:00<00:00, 10.86it/s]
INFO 10-16 13:24:37 metrics.py:351] Avg prompt throughput: 240.2 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.5%, CPU KV cache usage: 0.0%.
Compiling FSM index for all state transitions: 100%|ββββββββββ| 7/7 [00:00<00:00, 12.84it/s]
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [7,0,0], thread: [123,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
CRITICAL 10-16 13:24:39 launcher.py:72] AsyncLLMEngine has failed, terminating server process
INFO: 100.81.49.94:64400 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR 10-16 13:24:39 engine.py:157] RuntimeError('CUDA error: device-side assert triggered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n')
ERROR 10-16 13:24:39 engine.py:157] Traceback (most recent call last):
ERROR 10-16 13:24:39 engine.py:157] File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 155, in start
ERROR 10-16 13:24:39 engine.py:157] self.run_engine_loop()
ERROR 10-16 13:24:39 engine.py:157] File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 218, in run_engine_loop
ERROR 10-16 13:24:39 engine.py:157] request_outputs = self.engine_step()
ERROR 10-16 13:24:39 engine.py:157] File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 236, in engine_step
ERROR 10-16 13:24:39 engine.py:157] raise e
ERROR 10-16 13:24:39 engine.py:157] File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 227, in engine_step
ERROR 10-16 13:24:39 engine.py:157] return self.engine.step()
ERROR 10-16 13:24:39 engine.py:157] File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1387, in step
ERROR 10-16 13:24:39 engine.py:157] outputs = self.model_executor.execute_model(
ERROR 10-16 13:24:39 engine.py:157] File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 82, in execute_model
ERROR 10-16 13:24:39 engine.py:157] driver_outputs = self._driver_execute_model(execute_model_req)
ERROR 10-16 13:24:39 engine.py:157] File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 155, in _driver_execute_model
ERROR 10-16 13:24:39 engine.py:157] return self.driver_worker.execute_model(execute_model_req)
ERROR 10-16 13:24:39 engine.py:157] File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 327, in execute_model
ERROR 10-16 13:24:39 engine.py:157] output = self.model_runner.execute_model(
ERROR 10-16 13:24:39 engine.py:157] File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 10-16 13:24:39 engine.py:157] return func(*args, **kwargs)
ERROR 10-16 13:24:39 engine.py:157] File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/worker/enc_dec_model_runner.py", line 225, in execute_
model
ERROR 10-16 13:24:39 engine.py:157] output: SamplerOutput = self.model.sample(
ERROR 10-16 13:24:39 engine.py:157] File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/model_executor/models/mllama.py", line 940, in sample
ERROR 10-16 13:24:39 engine.py:157] next_tokens = self.sampler(logits, sampling_metadata)
ERROR 10-16 13:24:39 engine.py:157] File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-16 13:24:39 engine.py:157] return self._call_impl(*args, **kwargs)
ERROR 10-16 13:24:39 engine.py:157] File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-16 13:24:39 engine.py:157] return forward_call(*args, **kwargs)
ERROR 10-16 13:24:39 engine.py:157] File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 274, in forward
ERROR 10-16 13:24:39 engine.py:157] maybe_deferred_sample_results, maybe_sampled_tokens_tensor = _sample(
ERROR 10-16 13:24:39 engine.py:157] File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 879, in _sample
ERROR 10-16 13:24:39 engine.py:157] return _sample_with_torch(
ERROR 10-16 13:24:39 engine.py:157] File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 848, in _sample_with_torch
ERROR 10-16 13:24:39 engine.py:157] return get_pythonized_sample_results(
ERROR 10-16 13:24:39 engine.py:157] File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 713, in get_pythonized_sample_results
ERROR 10-16 13:24:39 engine.py:157] sample_results = _random_sample(seq_groups,
ERROR 10-16 13:24:39 engine.py:157] File "/mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 512, in _random_sample
ERROR 10-16 13:24:39 engine.py:157] random_samples = random_samples.cpu()
ERROR 10-16 13:24:39 engine.py:157] RuntimeError: CUDA error: device-side assert triggered
ERROR 10-16 13:24:39 engine.py:157] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 10-16 13:24:39 engine.py:157] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 10-16 13:24:39 engine.py:157] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 10-16 13:24:39 engine.py:157]
[rank0]:[E1016 13:24:39.286342262 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x14b5e674bf86 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x14b5e66fad10 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x14b5e6826f08 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x14b5e7a433e6 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x14b5e7a48600 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x14b5e7a4f2ba in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x14b5e7a516fc in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd3b55 (0x14b6351f9b55 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x14b6368d2609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x14b63669d353 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x14b5e674bf86 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x14b5e66fad10 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x14b5e6826f08 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x14b5e7a433e6 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x14b5e7a48600 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x14b5e7a4f2ba in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x14b5e7a516fc in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd3b55 (0x14b6351f9b55 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x14b6368d2609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x14b63669d353 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x14b5e674bf86 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x14b5e76daa84 in /mnt/localssd/milad/uv_cache/archive-v0/Vnhs10v8CWvRMhV-nlfmC/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd3b55 (0x14b6351f9b55 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x14b6368d2609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x14b63669d353 in /lib/x86_64-linux-gnu/libc.so.6)
INFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
INFO: Finished server process [2860211]
Does this ring any bells?
I'm using vllm-0.6.3.dev155+gf3a507f1.d20241010-cp38-abi3-manylinux1_x86_64.whl btw
Thanks @pavlo-ruban for finding the root cause and providing a fix! I just created a quick PR with your fix so that we can get this working without manual patching https://github.com/vllm-project/vllm/pull/9631.