vllm
vllm copied to clipboard
[Bug]: Docker vLLM 0.9.1 CUDA error: an illegal memory access, sampled_token_ids.tolist()
Your current environment
Docker on 4 x A100 SMX. BTW: vLLM 0.8.4 worked stable with same setup. 0.9.01 was already unstable (restarted few time a day), now even more.
services:
vllm-qwen25-72b:
image: vllm/vllm-openai:v0.9.1
container_name: vllm-qwen25-72b
environment:
...
- HF_TOKEN=$HF_TOKEN
- VLLM_NO_USAGE_STATS=1
ipc: host
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['0', '1', '2', '3']
capabilities: [ gpu ]
network_mode: host
volumes:
- /mnt/sda/huggingface:/root/.cache/huggingface
- .:/opt/vllm
command:
- --port=8000
- --disable-log-requests
- --model=Qwen/Qwen2.5-72B-Instruct
# - --served-model-name=Qwen/Qwen2.5-72B-Instruct
# - --max-model-len=32768
- --tensor-parallel-size=4
- --gpu-memory-utilization=0.90
- --swap-space=5
restart: unless-stopped
🐛 Describe the bug
See log file below
vLLM 0.9.1 crashes frequently with Qwen 2.5 on 4xA100 SMX.
(0.9.0.1 also crashed with "CUDA error: an illegal memory access was encountered", but much less frequently and not with a clear hint what went wrong. 0.8.4 was running stable.)
I have no example request - we use a mix of normal and guided JSON sampling.
This might be the main problem?
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] valid_sampled_token_ids = sampled_token_ids.tolist()
Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^^^^
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7fa563f785e8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] RuntimeError: CUDA error: an illegal memory access was encountered
Full log:
[rank0]:[E611 01:51:09.940883637 ProcessGroupNCCL.cpp:1896] [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7fa563f785e8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7fa563f0d4a2 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7fa564365422 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fa4f3c8b456 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7fa4f3c9b6f0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x782 (0x7fa4f3c9d282 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fa4f3c9ee8d in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7fa4e3fb3253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7fa564c42ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7fa564cd3a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] WorkerProc hit an exception.
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] Traceback (most recent call last):
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 522, in worker_busy_loop
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] output = func(*args, **kwargs)
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] return func(*args, **kwargs)
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 293, in execute_model
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] output = self.model_runner.execute_model(scheduler_output,
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] return func(*args, **kwargs)
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1374, in execute_model
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] valid_sampled_token_ids = sampled_token_ids.tolist()
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] RuntimeError: CUDA error: an illegal memory access was encountered
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] Traceback (most recent call last):
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 522, in worker_busy_loop
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] output = func(*args, **kwargs)
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] return func(*args, **kwargs)
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 293, in execute_model
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] output = self.model_runner.execute_model(scheduler_output,
what(): [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] return func(*args, **kwargs)
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1374, in execute_model
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] valid_sampled_token_ids = sampled_token_ids.tolist()
Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^^^^
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7fa563f785e8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] RuntimeError: CUDA error: an illegal memory access was encountered
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7fa563f0d4a2 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7fa564365422 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fa4f3c8b456 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7fa4f3c9b6f0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x782 (0x7fa4f3c9d282 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fa4f3c9ee8d in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7fa4e3fb3253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7fa564c42ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7fa564cd3a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1902 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7fa563f785e8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xcc7a4e (0x7fa4f3c6da4e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x9165ed (0x7fa4f38bc5ed in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xdc253 (0x7fa4e3fb3253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #4: <unknown function> + 0x94ac3 (0x7fa564c42ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #5: clone + 0x44 (0x7fa564cd3a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]
ERROR 06-11 01:51:09 [dump_input.py:69] Dumping input data
ERROR 06-11 01:51:09 [dump_input.py:71] V1 LLM engine (v0.9.1) with config: model='/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-72B-Instruct/snapshots/d3d951150c1e5848237cd6a7ad11df4836aee842/', speculative_config=None, tokenizer='/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-72B-Instruct/snapshots/d3d951150c1e5848237cd6a7ad11df4836aee842/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen2.5-72B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null},
ERROR 06-11 01:51:09 [dump_input.py:79] Dumping scheduler output for model execution:
ERROR 06-11 01:51:09 [dump_input.py:80] SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=[CachedRequestData(req_id='chatcmpl-f8adb07fdf2e41e69e9be99f4f9cc7eb', resumed_from_preemption=false, new_token_ids=[374], new_block_ids=[[]], num_computed_tokens=187), CachedRequestData(req_id='chatcmpl-a0b407784af14747b4a9af20d4d69829', resumed_from_preemption=false, new_token_ids=[330], new_block_ids=[[]], num_computed_tokens=2184), CachedRequestData(req_id='chatcmpl-d52cc47002544eaa97785872789929c8', resumed_from_preemption=false, new_token_ids=[330], new_block_ids=[[]], num_computed_tokens=9828), CachedRequestData(req_id='chatcmpl-6505bbb4a369474fb64b00f9e8e36de7', resumed_from_preemption=false, new_token_ids=[1008], new_block_ids=[[]], num_computed_tokens=66), CachedRequestData(req_id='chatcmpl-835ba60f60fe4171b7cc74141ca68a31', resumed_from_preemption=false, new_token_ids=[1008], new_block_ids=[[]], num_computed_tokens=66)], num_scheduled_tokens={chatcmpl-835ba60f60fe4171b7cc74141ca68a31: 1, chatcmpl-6505bbb4a369474fb64b00f9e8e36de7: 1, chatcmpl-a0b407784af14747b4a9af20d4d69829: 1, chatcmpl-f8adb07fdf2e41e69e9be99f4f9cc7eb: 1, chatcmpl-d52cc47002544eaa97785872789929c8: 1}, total_num_scheduled_tokens=5, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0], finished_req_ids=[], free_encoder_input_ids=[], structured_output_request_ids={chatcmpl-a0b407784af14747b4a9af20d4d69829: 1, chatcmpl-d52cc47002544eaa97785872789929c8: 2}, grammar_bitmask=array([[ 0, 0, 2, ..., 0, 0, 0],
ERROR 06-11 01:51:09 [dump_input.py:80] [ 0, 1507336, 0, ..., 0, 0, 0]],
ERROR 06-11 01:51:09 [dump_input.py:80] shape=(2, 4752), dtype=int32), kv_connector_metadata=null)
ERROR 06-11 01:51:09 [dump_input.py:82] SchedulerStats(num_running_reqs=5, num_waiting_reqs=0, gpu_cache_usage=0.026913812964708295, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0), spec_decoding_stats=None)
ERROR 06-11 01:51:09 [core.py:517] EngineCore encountered a fatal error.
ERROR 06-11 01:51:09 [core.py:517] Traceback (most recent call last):
ERROR 06-11 01:51:09 [core.py:517] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 508, in run_engine_core
ERROR 06-11 01:51:09 [core.py:517] engine_core.run_busy_loop()
ERROR 06-11 01:51:09 [core.py:517] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 535, in run_busy_loop
ERROR 06-11 01:51:09 [core.py:517] self._process_engine_step()
ERROR 06-11 01:51:09 [core.py:517] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 560, in _process_engine_step
ERROR 06-11 01:51:09 [core.py:517] outputs, model_executed = self.step_fn()
ERROR 06-11 01:51:09 [core.py:517] ^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [core.py:517] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 231, in step
ERROR 06-11 01:51:09 [core.py:517] model_output = self.execute_model(scheduler_output)
ERROR 06-11 01:51:09 [core.py:517] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [core.py:517] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 217, in execute_model
ERROR 06-11 01:51:09 [core.py:517] raise err
ERROR 06-11 01:51:09 [core.py:517] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 211, in execute_model
ERROR 06-11 01:51:09 [core.py:517] return self.model_executor.execute_model(scheduler_output)
ERROR 06-11 01:51:09 [core.py:517] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [core.py:517] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 163, in execute_model
ERROR 06-11 01:51:09 [core.py:517] (output, ) = self.collective_rpc("execute_model",
ERROR 06-11 01:51:09 [core.py:517] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [core.py:517] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 220, in collective_rpc
ERROR 06-11 01:51:09 [core.py:517] result = get_response(w, dequeue_timeout)
ERROR 06-11 01:51:09 [core.py:517] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [core.py:517] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 207, in get_response
ERROR 06-11 01:51:09 [core.py:517] raise RuntimeError(
ERROR 06-11 01:51:09 [core.py:517] RuntimeError: Worker failed with error 'CUDA error: an illegal memory access was encountered
ERROR 06-11 01:51:09 [core.py:517] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 06-11 01:51:09 [core.py:517] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 06-11 01:51:09 [core.py:517] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 06-11 01:51:09 [core.py:517] ', please check the stack trace above for the root cause
ERROR 06-11 01:51:09 [async_llm.py:420] AsyncLLM output_handler failed.
ERROR 06-11 01:51:09 [async_llm.py:420] Traceback (most recent call last):
ERROR 06-11 01:51:09 [async_llm.py:420] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 379, in output_handler
ERROR 06-11 01:51:09 [async_llm.py:420] outputs = await engine_core.get_output_async()
ERROR 06-11 01:51:09 [async_llm.py:420] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [async_llm.py:420] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 790, in get_output_async
ERROR 06-11 01:51:09 [async_llm.py:420] raise self._format_exception(outputs) from None
ERROR 06-11 01:51:09 [async_llm.py:420] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
ERROR 06-11 01:51:09 [serving_chat.py:911] Error in chat completion stream generator.
ERROR 06-11 01:51:09 [serving_chat.py:911] Traceback (most recent call last):
ERROR 06-11 01:51:09 [serving_chat.py:911] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 481, in chat_completion_stream_generator
ERROR 06-11 01:51:09 [serving_chat.py:911] async for res in result_generator:
ERROR 06-11 01:51:09 [serving_chat.py:911] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 327, in generate
ERROR 06-11 01:51:09 [serving_chat.py:911] out = q.get_nowait() or await q.get()
ERROR 06-11 01:51:09 [serving_chat.py:911] ^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 52, in get
ERROR 06-11 01:51:09 [serving_chat.py:911] raise output
ERROR 06-11 01:51:09 [serving_chat.py:911] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/utils.py", line 100, in wrapper
ERROR 06-11 01:51:09 [serving_chat.py:911] return await func(*args, **kwargs)
ERROR 06-11 01:51:09 [serving_chat.py:911] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 554, in create_chat_completion
ERROR 06-11 01:51:09 [serving_chat.py:911] generator = await handler.create_chat_completion(request, raw_request)
ERROR 06-11 01:51:09 [serving_chat.py:911] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 268, in create_chat_completion
ERROR 06-11 01:51:09 [serving_chat.py:911] return await self.chat_completion_full_generator(
ERROR 06-11 01:51:09 [serving_chat.py:911] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 932, in chat_completion_full_generator
ERROR 06-11 01:51:09 [serving_chat.py:911] async for res in result_generator:
ERROR 06-11 01:51:09 [serving_chat.py:911] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 327, in generate
ERROR 06-11 01:51:09 [serving_chat.py:911] out = q.get_nowait() or await q.get()
ERROR 06-11 01:51:09 [serving_chat.py:911] ^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 52, in get
ERROR 06-11 01:51:09 [serving_chat.py:911] raise output
ERROR 06-11 01:51:09 [serving_chat.py:911] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 379, in output_handler
ERROR 06-11 01:51:09 [serving_chat.py:911] outputs = await engine_core.get_output_async()
ERROR 06-11 01:51:09 [serving_chat.py:911] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 790, in get_output_async
ERROR 06-11 01:51:09 [serving_chat.py:911] raise self._format_exception(outputs) from None
ERROR 06-11 01:51:09 [serving_chat.py:911] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
ERROR 06-11 01:51:09 [serving_chat.py:911] Error in chat completion stream generator.
ERROR 06-11 01:51:09 [serving_chat.py:911] Traceback (most recent call last):
ERROR 06-11 01:51:09 [serving_chat.py:911] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 481, in chat_completion_stream_generator
ERROR 06-11 01:51:09 [serving_chat.py:911] async for res in result_generator:
ERROR 06-11 01:51:09 [serving_chat.py:911] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 327, in generate
ERROR 06-11 01:51:09 [serving_chat.py:911] out = q.get_nowait() or await q.get()
ERROR 06-11 01:51:09 [serving_chat.py:911] ^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 52, in get
ERROR 06-11 01:51:09 [serving_chat.py:911] raise output
ERROR 06-11 01:51:09 [serving_chat.py:911] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 481, in chat_completion_stream_generator
ERROR 06-11 01:51:09 [serving_chat.py:911] async for res in result_generator:
ERROR 06-11 01:51:09 [serving_chat.py:911] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 327, in generate
ERROR 06-11 01:51:09 [serving_chat.py:911] out = q.get_nowait() or await q.get()
ERROR 06-11 01:51:09 [serving_chat.py:911] ^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 52, in get
ERROR 06-11 01:51:09 [serving_chat.py:911] raise output
ERROR 06-11 01:51:09 [serving_chat.py:911] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/utils.py", line 100, in wrapper
ERROR 06-11 01:51:09 [serving_chat.py:911] return await func(*args, **kwargs)
ERROR 06-11 01:51:09 [serving_chat.py:911] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 554, in create_chat_completion
ERROR 06-11 01:51:09 [serving_chat.py:911] generator = await handler.create_chat_completion(request, raw_request)
ERROR 06-11 01:51:09 [serving_chat.py:911] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 268, in create_chat_completion
ERROR 06-11 01:51:09 [serving_chat.py:911] return await self.chat_completion_full_generator(
ERROR 06-11 01:51:09 [serving_chat.py:911] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 932, in chat_completion_full_generator
ERROR 06-11 01:51:09 [serving_chat.py:911] async for res in result_generator:
ERROR 06-11 01:51:09 [serving_chat.py:911] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 327, in generate
ERROR 06-11 01:51:09 [serving_chat.py:911] out = q.get_nowait() or await q.get()
ERROR 06-11 01:51:09 [serving_chat.py:911] ^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 52, in get
ERROR 06-11 01:51:09 [serving_chat.py:911] raise output
ERROR 06-11 01:51:09 [serving_chat.py:911] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 379, in output_handler
ERROR 06-11 01:51:09 [serving_chat.py:911] outputs = await engine_core.get_output_async()
ERROR 06-11 01:51:09 [serving_chat.py:911] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 790, in get_output_async
ERROR 06-11 01:51:09 [serving_chat.py:911] raise self._format_exception(outputs) from None
ERROR 06-11 01:51:09 [serving_chat.py:911] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
INFO: 127.0.0.1:46320 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO: 172.19.103.111:36678 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO: 172.19.103.111:57278 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
INFO: Finished server process [1]
[rank2]:[W611 01:51:09.369133081 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=86, addr=[localhost]:37972, remote=[localhost]:59835): failed to recv, got 0 bytes
Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:678 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f86bc1785e8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ba8afe (0x7f86a023cafe in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5baae40 (0x7f86a023ee40 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5bab74a (0x7f86a023f74a in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x2a9 (0x7f86a02391a9 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::heartbeatMonitor() + 0x379 (0x7f864be99989 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7f863c1b3253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7f86bce6cac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: clone + 0x44 (0x7f86bcefda04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[rank2]:[W611 01:51:09.374069347 ProcessGroupNCCL.cpp:1659] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: failed to recv, got 0 bytes
[rank3]:[W611 01:51:09.424776342 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=86, addr=[localhost]:37988, remote=[localhost]:59835): failed to recv, got 0 bytes
Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:678 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7fc675f785e8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ba8afe (0x7fc65a03cafe in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5baae40 (0x7fc65a03ee40 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5bab74a (0x7fc65a03f74a in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x2a9 (0x7fc65a0391a9 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::heartbeatMonitor() + 0x379 (0x7fc605c99989 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7fc5f5fb3253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7fc676ce9ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: clone + 0x44 (0x7fc676d7aa04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[rank3]:[W611 01:51:09.429163290 ProcessGroupNCCL.cpp:1659] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: failed to recv, got 0 bytes
[rank1]:[W611 01:51:09.436189340 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=86, addr=[localhost]:38002, remote=[localhost]:59835): failed to recv, got 0 bytes
Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:678 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7efce171e5e8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ba8afe (0x7efd3683cafe in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5baae40 (0x7efd3683ee40 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5bab74a (0x7efd3683f74a in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x2a9 (0x7efd368391a9 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::heartbeatMonitor() + 0x379 (0x7efce2499989 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7efcd27b3253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7efd53341ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: clone + 0x44 (0x7efd533d2a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[rank1]:[W611 01:51:09.440752408 ProcessGroupNCCL.cpp:1659] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: failed to recv, got 0 bytes
nanobind: leaked 4 instances!
- leaked instance 0x7efc44398258 of type "xgrammar.xgrammar_bindings.GrammarMatcher"
- leaked instance 0x7efc442df2e8 of type "xgrammar.xgrammar_bindings.CompiledGrammar"
- leaked instance 0x7efc4438b798 of type "xgrammar.xgrammar_bindings.CompiledGrammar"
- leaked instance 0x7efc44396718 of type "xgrammar.xgrammar_bindings.GrammarMatcher"
nanobind: leaked 2 types!
- leaked type "xgrammar.xgrammar_bindings.GrammarMatcher"
- leaked type "xgrammar.xgrammar_bindings.CompiledGrammar"
nanobind: leaked 13 functions!
- leaked function "fill_next_token_bitmask"
- leaked function "rollback"
- leaked function "__init__"
- leaked function ""
- leaked function ""
- leaked function ""
- leaked function ""
- leaked function ""
- leaked function "find_jump_forward_string"
- leaked function "reset"
- leaked function "_debug_accept_string"
- leaked function "is_terminated"
- leaked function "accept_token"
nanobind: this is likely caused by a reference counting issue in the binding code.
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Before submitting a new issue...
- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Could you try to reproduce fail with export CUDA_LAUNCH_BLOCKING=1
We implemented the suggested environment.
INFO 06-12 01:21:11 [logger.py:43] Received request chatcmpl-7027be498a4646fbbacc26d98d1c3047: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nThere are 9 birds in the tree, the hunter shoots one, how many birds are left in the tree?<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.2, repetition_penalty=1.05, temperature=0.2, top_p=0.1, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=32716, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-12 01:21:11 [async_llm.py:271] Added request chatcmpl-7027be498a4646fbbacc26d98d1c3047.
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] WorkerProc hit an exception.
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] Traceback (most recent call last):
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py, line 522, in worker_busy_loop
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] output = func(*args, **kwargs)
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py, line 116, in decorate_context
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] return func(*args, **kwargs)
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py, line 293, in execute_model
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] output = self.model_runner.execute_model(scheduler_output,
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py, line 116, in decorate_context
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] return func(*args, **kwargs)
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py, line 1309, in execute_model
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] sampler_output = self.sampler(
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py, line 1751, in _wrapped_call_impl
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] return self._call_impl(*args, **kwargs)
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py, line 1762, in _call_impl
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] return forward_call(*args, **kwargs)
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/vllm/v1/sample/sampler.py, line 52, in forward
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] sampled = self.sample(logits, sampling_metadata)
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/vllm/v1/sample/sampler.py, line 118, in sample
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] random_sampled = self.topk_topp_sampler(
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py, line 1751, in _wrapped_call_impl
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] return self._call_impl(*args, **kwargs)
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py, line 1762, in _call_impl
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] return forward_call(*args, **kwargs)
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/vllm/v1/sample/ops/topk_topp_sampler.py, line 104, in forward_cuda
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] return flashinfer_sample(logits, k, p, generators)
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/vllm/v1/sample/ops/topk_topp_sampler.py, line 290, in flashinfer_sample
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] next_token_ids = flashinfer.sampling.top_k_top_p_sampling_from_logits(
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/flashinfer/sampling.py, line 902, in top_k_top_p_sampling_from_logits
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] return top_p_sampling_from_probs(
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/flashinfer/sampling.py, line 642, in top_p_sampling_from_probs
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] return get_sampling_module().top_p_sampling_from_probs(
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/flashinfer/sampling.py, line 130, in top_p_sampling_from_probs
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] module.top_p_sampling_from_probs.default(
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/torch/_ops.py, line 756, in __call__
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] return self._op(*args, **kwargs)
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] RuntimeError: TopPSamplingFromProbs failed with error code an illegal memory access was encountered
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] Traceback (most recent call last):
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py, line 522, in worker_busy_loop
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] output = func(*args, **kwargs)
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py, line 116, in decorate_context
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] return func(*args, **kwargs)
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py, line 293, in execute_model
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] output = self.model_runner.execute_model(scheduler_output,
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py, line 116, in decorate_context
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] return func(*args, **kwargs)
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py, line 1309, in execute_model
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] sampler_output = self.sampler(
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py, line 1751, in _wrapped_call_impl
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] return self._call_impl(*args, **kwargs)
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py, line 1762, in _call_impl
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] return forward_call(*args, **kwargs)
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/vllm/v1/sample/sampler.py, line 52, in forward
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] sampled = self.sample(logits, sampling_metadata)
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/vllm/v1/sample/sampler.py, line 118, in sample
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] random_sampled = self.topk_topp_sampler(
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py, line 1751, in _wrapped_call_impl
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] return self._call_impl(*args, **kwargs)
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py, line 1762, in _call_impl
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] return forward_call(*args, **kwargs)
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/vllm/v1/sample/ops/topk_topp_sampler.py, line 104, in forward_cuda
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] return flashinfer_sample(logits, k, p, generators)
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/vllm/v1/sample/ops/topk_topp_sampler.py, line 290, in flashinfer_sample
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] next_token_ids = flashinfer.sampling.top_k_top_p_sampling_from_logits(
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/flashinfer/sampling.py, line 902, in top_k_top_p_sampling_from_logits
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] return top_p_sampling_from_probs(
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/flashinfer/sampling.py, line 642, in top_p_sampling_from_probs
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] return get_sampling_module().top_p_sampling_from_probs(
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/flashinfer/sampling.py, line 130, in top_p_sampling_from_probs
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] module.top_p_sampling_from_probs.default(
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/torch/_ops.py, line 756, in __call__
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] return self._op(*args, **kwargs)
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] RuntimeError: TopPSamplingFromProbs failed with error code an illegal memory access was encountered
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527]
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] WorkerProc hit an exception.
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] Traceback (most recent call last):
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py, line 522, in worker_busy_loop
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] output = func(*args, **kwargs)
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py, line 116, in decorate_context
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] return func(*args, **kwargs)
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py, line 293, in execute_model
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] output = self.model_runner.execute_model(scheduler_output,
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py, line 116, in decorate_context
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] return func(*args, **kwargs)
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py, line 1187, in execute_model
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] self._prepare_inputs(scheduler_output))
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py, line 563, in _prepare_inputs
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] self.input_batch.block_table.commit(num_reqs)
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/vllm/v1/worker/block_table.py, line 134, in commit
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] block_table.commit(num_reqs)
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/vllm/v1/worker/block_table.py, line 83, in commit
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] self.block_table[:num_reqs].copy_(self.block_table_cpu[:num_reqs],
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] RuntimeError: CUDA error: an illegal memory access was encountered
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527]
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] Traceback (most recent call last):
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py, line 522, in worker_busy_loop
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] output = func(*args, **kwargs)
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py, line 116, in decorate_context
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] return func(*args, **kwargs)
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py, line 293, in execute_model
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] output = self.model_runner.execute_model(scheduler_output,
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py, line 116, in decorate_context
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] return func(*args, **kwargs)
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py, line 1187, in execute_model
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] self._prepare_inputs(scheduler_output))
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py, line 563, in _prepare_inputs
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] self.input_batch.block_table.commit(num_reqs)
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/vllm/v1/worker/block_table.py, line 134, in commit
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] block_table.commit(num_reqs)
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] File /usr/local/lib/python3.12/dist-packages/vllm/v1/worker/block_table.py, line 83, in commit
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] self.block_table[:num_reqs].copy_(self.block_table_cpu[:num_reqs],
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] RuntimeError: CUDA error: an illegal memory access was encountered
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527]
[1;36m(VllmWorker rank=3 pid=229)[0;0m ERROR 06-12 01:21:13 [multiproc_executor.py:527]
INFO 06-12 01:21:16 [async_llm.py:432] Aborted request chatcmpl-7027be498a4646fbbacc26d98d1c3047.
INFO 06-12 01:21:16 [async_llm.py:340] Request chatcmpl-7027be498a4646fbbacc26d98d1c3047 aborted.
INFO 06-12 01:21:17 [loggers.py:118] Engine 000: Avg prompt throughput: 5.2 tokens/s, Avg generation throughput: 13.0 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.5%, Prefix cache hit rate: 32.3%
INFO 06-12 01:21:27 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.5%, Prefix cache hit rate: 32.3%
INFO 06-12 01:21:29 [logger.py:43] Received request chatcmpl-b56e06a213a748de884b13b5511e6c41: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nThere are 9 birds in the tree, the hunter shoots one, how many birds are left in the tree?<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.2, repetition_penalty=1.05, temperature=0.2, top_p=0.1, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=32716, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-12 01:21:29 [async_llm.py:271] Added request chatcmpl-b56e06a213a748de884b13b5511e6c41.
INFO: 192.168.230.103:54274 - GET /metrics HTTP/1.1 200 OK
INFO 06-12 01:21:34 [async_llm.py:432] Aborted request chatcmpl-b56e06a213a748de884b13b5511e6c41.
INFO 06-12 01:21:34 [async_llm.py:340] Request chatcmpl-b56e06a213a748de884b13b5511e6c41 aborted.
INFO 06-12 01:21:45 [logger.py:43] Received request chatcmpl-d6f5c98645d24f44b84f720e3f68df23: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nhealth check<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.05, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=32737, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-12 01:21:45 [async_llm.py:271] Added request chatcmpl-d6f5c98645d24f44b84f720e3f68df23.
INFO: 192.168.230.103:54300 - GET /metrics HTTP/1.1 200 OK
INFO 06-12 01:22:16 [logger.py:43] Received request chatcmpl-083b573d914647cd93dcdf9f9b943bb0: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nThere are 9 birds in the tree, the hunter shoots one, how many birds are left in the tree?<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.2, repetition_penalty=1.05, temperature=0.2, top_p=0.1, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=32716, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-12 01:22:16 [async_llm.py:271] Added request chatcmpl-083b573d914647cd93dcdf9f9b943bb0.
INFO 06-12 01:22:21 [async_llm.py:432] Aborted request chatcmpl-083b573d914647cd93dcdf9f9b943bb0.
INFO 06-12 01:22:21 [async_llm.py:340] Request chatcmpl-083b573d914647cd93dcdf9f9b943bb0 aborted.
INFO: 192.168.230.103:54328 - GET /metrics HTTP/1.1 200 OK
INFO: 192.168.230.103:54354 - GET /metrics HTTP/1.1 200 OK
INFO 06-12 01:23:13 [async_llm.py:432] Aborted request chatcmpl-cba5dce3fdac4061890e5b38eeaf43d6.
INFO 06-12 01:23:13 [async_llm.py:340] Request chatcmpl-cba5dce3fdac4061890e5b38eeaf43d6 aborted.
INFO 06-12 01:23:15 [logger.py:43] Received request chatcmpl-963b7813933f4b428e4d971e5a07c64f: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nhealth check<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.05, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=32737, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-12 01:23:15 [async_llm.py:271] Added request chatcmpl-963b7813933f4b428e4d971e5a07c64f.
INFO 06-12 01:23:21 [logger.py:43] Received request chatcmpl-532c8809339f416b9e92c59a476bbf44: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nThere are 9 birds in the tree, the hunter shoots one, how many birds are left in the tree?<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.2, repetition_penalty=1.05, temperature=0.2, top_p=0.1, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=32716, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-12 01:23:21 [async_llm.py:271] Added request chatcmpl-532c8809339f416b9e92c59a476bbf44.
INFO 06-12 01:23:26 [async_llm.py:432] Aborted request chatcmpl-532c8809339f416b9e92c59a476bbf44.
INFO 06-12 01:23:26 [async_llm.py:340] Request chatcmpl-532c8809339f416b9e92c59a476bbf44 aborted.
INFO: 192.168.230.103:54398 - GET /metrics HTTP/1.1 200 OK
INFO 06-12 01:23:35 [logger.py:43] Received request chatcmpl-d8021d610503400f8660451018cff61b: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nThere are 9 birds in the tree, the hunter shoots one, how many birds are left in the tree?<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.2, repetition_penalty=1.05, temperature=0.2, top_p=0.1, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=32716, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-12 01:23:35 [async_llm.py:271] Added request chatcmpl-d8021d610503400f8660451018cff61b.
INFO 06-12 01:23:40 [async_llm.py:432] Aborted request chatcmpl-d8021d610503400f8660451018cff61b.
INFO 06-12 01:23:40 [async_llm.py:340] Request chatcmpl-d8021d610503400f8660451018cff61b aborted.
INFO: 192.168.230.103:54434 - GET /metrics HTTP/1.1 200 OK
INFO 06-12 01:24:01 [launcher.py:80] Shutting down FastAPI HTTP server.
ERROR 06-12 01:24:01 [dump_input.py:69] Dumping input data
ERROR 06-12 01:24:01 [dump_input.py:71] V1 LLM engine (v0.9.1) with config: model='/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-72B-Instruct/snapshots/d3d951150c1e5848237cd6a7ad11df4836aee842/', speculative_config=None, tokenizer='/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-72B-Instruct/snapshots/d3d951150c1e5848237cd6a7ad11df4836aee842/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen2.5-72B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={level:3,debug_dump_path:,cache_dir:,backend:,custom_ops:[none],splitting_ops:[vllm.unified_attention,vllm.unified_attention_with_output],use_inductor:true,compile_sizes:[],inductor_compile_config:{enable_auto_functionalized_v2:false},inductor_passes:{},use_cudagraph:true,cudagraph_num_of_warmups:1,cudagraph_capture_sizes:[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],cudagraph_copy_inputs:false,full_cuda_graph:false,max_capture_size:512,local_cache_dir:null},
ERROR 06-12 01:24:01 [dump_input.py:79] Dumping scheduler output for model execution:
ERROR 06-12 01:24:01 [dump_input.py:80] SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=[CachedRequestData(req_id='chatcmpl-cba5dce3fdac4061890e5b38eeaf43d6', resumed_from_preemption=false, new_token_ids=[330], new_block_ids=[[]], num_computed_tokens=6700), CachedRequestData(req_id='chatcmpl-7027be498a4646fbbacc26d98d1c3047', resumed_from_preemption=false, new_token_ids=[19654], new_block_ids=[[]], num_computed_tokens=84)], num_scheduled_tokens={chatcmpl-cba5dce3fdac4061890e5b38eeaf43d6: 1, chatcmpl-7027be498a4646fbbacc26d98d1c3047: 1}, total_num_scheduled_tokens=2, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0], finished_req_ids=[], free_encoder_input_ids=[], structured_output_request_ids={chatcmpl-cba5dce3fdac4061890e5b38eeaf43d6: 0}, grammar_bitmask=array([[ 0, 1507336, 0, ..., 0, 0, 0]],
ERROR 06-12 01:24:01 [dump_input.py:80] shape=(1, 4752), dtype=int32), kv_connector_metadata=null)
ERROR 06-12 01:24:01 [dump_input.py:82] SchedulerStats(num_running_reqs=2, num_waiting_reqs=0, gpu_cache_usage=0.014870148003351069, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0), spec_decoding_stats=None)
Failed: Cuda error /workspace/csrc/custom_all_reduce.cuh:594 'an illegal memory access was encountered'
INFO: Shutting down
INFO: Waiting for connections to close. (CTRL+C to force quit)
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 4 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 4 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
INFO 06-12 01:24:15 [__init__.py:244] Automatically detected platform cuda.
INFO 06-12 01:24:17 [api_server.py:1287] vLLM API server version 0.9.1
INFO 06-12 01:24:17 [cli_args.py:309] non-default args: {'model': '/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-72B-Instruct/snapshots/d3d951150c1e5848237cd6a7ad11df4836aee842/', 'served_model_name': ['Qwen/Qwen2.5-72B-Instruct'], 'tensor_parallel_size': 4, 'swap_space': 10.0}
INFO 06-12 01:24:23 [config.py:823] This model supports multiple tasks: {'generate', 'score', 'embed', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 06-12 01:24:23 [config.py:1946] Defaulting to use mp for distributed inference
INFO 06-12 01:24:23 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=2048.
WARNING 06-12 01:24:25 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: https://github.com/NVIDIA/nccl/issues/1234
INFO 06-12 01:24:27 [__init__.py:244] Automatically detected platform cuda.
INFO 06-12 01:24:29 [core.py:455] Waiting for init message from front-end.
INFO 06-12 01:24:29 [core.py:70] Initializing a V1 LLM engine (v0.9.1) with config: model='/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-72B-Instruct/snapshots/d3d951150c1e5848237cd6a7ad11df4836aee842/', speculative_config=None, tokenizer='/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-72B-Instruct/snapshots/d3d951150c1e5848237cd6a7ad11df4836aee842/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen2.5-72B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={level:3,debug_dump_path:,cache_dir:,backend:,custom_ops:[none],splitting_ops:[vllm.unified_attention,vllm.unified_attention_with_output],use_inductor:true,compile_sizes:[],inductor_compile_config:{enable_auto_functionalized_v2:false},inductor_passes:{},use_cudagraph:true,cudagraph_num_of_warmups:1,cudagraph_capture_sizes:[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],cudagraph_copy_inputs:false,full_cuda_graph:false,max_capture_size:512,local_cache_dir:null}
WARNING 06-12 01:24:29 [multiproc_worker_utils.py:307] Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 06-12 01:24:29 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 16777216, 10, 'psm_25936080'), local_subscribe_addr='ipc:///tmp/9e8862f9-69b3-41ed-92d9-fed62d83a90c', remote_subscribe_addr=None, remote_addr_ipv6=False)
WARNING 06-12 01:24:31 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: https://github.com/NVIDIA/nccl/issues/1234
WARNING 06-12 01:24:31 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: https://github.com/NVIDIA/nccl/issues/1234
WARNING 06-12 01:24:31 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: https://github.com/NVIDIA/nccl/issues/1234
WARNING 06-12 01:24:31 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: https://github.com/NVIDIA/nccl/issues/1234
INFO 06-12 01:24:32 [__init__.py:244] Automatically detected platform cuda.
INFO 06-12 01:24:32 [__init__.py:244] Automatically detected platform cuda.
INFO 06-12 01:24:32 [__init__.py:244] Automatically detected platform cuda.
INFO 06-12 01:24:33 [__init__.py:244] Automatically detected platform cuda.
WARNING 06-12 01:24:37 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f9c44a1bec0>
WARNING 06-12 01:24:37 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f46386e5760>
WARNING 06-12 01:24:37 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f2f92a2a750>
WARNING 06-12 01:24:37 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f14d0d9a450>
[1;36m(VllmWorker rank=1 pid=227)[0;0m INFO 06-12 01:24:37 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_390834ee'), local_subscribe_addr='ipc:///tmp/f680ecf8-5942-490d-91b4-c9f5d3c56d6b', remote_subscribe_addr=None, remote_addr_ipv6=False)
[1;36m(VllmWorker rank=3 pid=229)[0;0m INFO 06-12 01:24:37 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_08ae41d3'), local_subscribe_addr='ipc:///tmp/499d45d3-e1c8-45af-a60c-caf57f59cd83', remote_subscribe_addr=None, remote_addr_ipv6=False)
[1;36m(VllmWorker rank=0 pid=226)[0;0m INFO 06-12 01:24:37 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_1c69cebe'), local_subscribe_addr='ipc:///tmp/97b6cd64-858b-4c0b-8eca-d7da0208c85a', remote_subscribe_addr=None, remote_addr_ipv6=False)
[1;36m(VllmWorker rank=2 pid=228)[0;0m INFO 06-12 01:24:37 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_92c1fb48'), local_subscribe_addr='ipc:///tmp/9ecb42fd-7fa5-4255-bd96-da13173eb1a3', remote_subscribe_addr=None, remote_addr_ipv6=False)
[1;36m(VllmWorker rank=1 pid=227)[0;0m INFO 06-12 01:24:38 [utils.py:1126] Found nccl from library libnccl.so.2
[1;36m(VllmWorker rank=3 pid=229)[0;0m INFO 06-12 01:24:38 [utils.py:1126] Found nccl from library libnccl.so.2
[1;36m(VllmWorker rank=1 pid=227)[0;0m INFO 06-12 01:24:38 [pynccl.py:70] vLLM is using nccl==2.26.2
[1;36m(VllmWorker rank=3 pid=229)[0;0m INFO 06-12 01:24:38 [pynccl.py:70] vLLM is using nccl==2.26.2
[1;36m(VllmWorker rank=2 pid=228)[0;0m INFO 06-12 01:24:38 [utils.py:1126] Found nccl from library libnccl.so.2
[1;36m(VllmWorker rank=0 pid=226)[0;0m INFO 06-12 01:24:38 [utils.py:1126] Found nccl from library libnccl.so.2
[1;36m(VllmWorker rank=2 pid=228)[0;0m INFO 06-12 01:24:38 [pynccl.py:70] vLLM is using nccl==2.26.2
[1;36m(VllmWorker rank=0 pid=226)[0;0m INFO 06-12 01:24:38 [pynccl.py:70] vLLM is using nccl==2.26.2
[1;36m(VllmWorker rank=2 pid=228)[0;0m INFO 06-12 01:24:38 [custom_all_reduce_utils.py:246] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
[1;36m(VllmWorker rank=3 pid=229)[0;0m INFO 06-12 01:24:38 [custom_all_reduce_utils.py:246] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
[1;36m(VllmWorker rank=1 pid=227)[0;0m INFO 06-12 01:24:38 [custom_all_reduce_utils.py:246] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
[1;36m(VllmWorker rank=0 pid=226)[0;0m INFO 06-12 01:24:38 [custom_all_reduce_utils.py:246] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
[1;36m(VllmWorker rank=0 pid=226)[0;0m INFO 06-12 01:24:38 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_fcb5e00a'), local_subscribe_addr='ipc:///tmp/d222a8a0-a702-4fae-bb44-25a8c661e4e4', remote_subscribe_addr=None, remote_addr_ipv6=False)
[1;36m(VllmWorker rank=1 pid=227)[0;0m INFO 06-12 01:24:38 [parallel_state.py:1065] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
[1;36m(VllmWorker rank=3 pid=229)[0;0m INFO 06-12 01:24:38 [parallel_state.py:1065] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3, EP rank 3
[1;36m(VllmWorker rank=0 pid=226)[0;0m INFO 06-12 01:24:38 [parallel_state.py:1065] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
[1;36m(VllmWorker rank=2 pid=228)[0;0m INFO 06-12 01:24:38 [parallel_state.py:1065] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2, EP rank 2
[1;36m(VllmWorker rank=1 pid=227)[0;0m INFO 06-12 01:24:38 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
[1;36m(VllmWorker rank=0 pid=226)[0;0m INFO 06-12 01:24:38 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
[1;36m(VllmWorker rank=3 pid=229)[0;0m INFO 06-12 01:24:38 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
[1;36m(VllmWorker rank=2 pid=228)[0;0m INFO 06-12 01:24:38 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
[1;36m(VllmWorker rank=0 pid=226)[0;0m INFO 06-12 01:24:38 [gpu_model_runner.py:1595] Starting to load model /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-72B-Instruct/snapshots/d3d951150c1e5848237cd6a7ad11df4836aee842/...
[1;36m(VllmWorker rank=1 pid=227)[0;0m INFO 06-12 01:24:38 [gpu_model_runner.py:1595] Starting to load model /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-72B-Instruct/snapshots/d3d951150c1e5848237cd6a7ad11df4836aee842/...
[1;36m(VllmWorker rank=2 pid=228)[0;0m INFO 06-12 01:24:38 [gpu_model_runner.py:1595] Starting to load model /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-72B-Instruct/snapshots/d3d951150c1e5848237cd6a7ad11df4836aee842/...
[1;36m(VllmWorker rank=3 pid=229)[0;0m INFO 06-12 01:24:38 [gpu_model_runner.py:1595] Starting to load model /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-72B-Instruct/snapshots/d3d951150c1e5848237cd6a7ad11df4836aee842/...
[1;36m(VllmWorker rank=0 pid=226)[0;0m INFO 06-12 01:24:38 [gpu_model_runner.py:1600] Loading model from scratch...
[1;36m(VllmWorker rank=1 pid=227)[0;0m INFO 06-12 01:24:38 [gpu_model_runner.py:1600] Loading model from scratch...
[1;36m(VllmWorker rank=2 pid=228)[0;0m INFO 06-12 01:24:38 [gpu_model_runner.py:1600] Loading model from scratch...
[1;36m(VllmWorker rank=3 pid=229)[0;0m INFO 06-12 01:24:38 [gpu_model_runner.py:1600] Loading model from scratch...
[1;36m(VllmWorker rank=0 pid=226)[0;0m INFO 06-12 01:24:38 [cuda.py:252] Using Flash Attention backend on V1 engine.
[1;36m(VllmWorker rank=2 pid=228)[0;0m INFO 06-12 01:24:38 [cuda.py:252] Using Flash Attention backend on V1 engine.
[1;36m(VllmWorker rank=1 pid=227)[0;0m INFO 06-12 01:24:38 [cuda.py:252] Using Flash Attention backend on V1 engine.
[1;36m(VllmWorker rank=3 pid=229)[0;0m INFO 06-12 01:24:38 [cuda.py:252] Using Flash Attention backend on V1 engine.
[1;36m(VllmWorker rank=0 pid=226)[0;0m
It looks like the error came from flashinfer's sampling kernels, you might want to considering trying again with
export VLLM_USE_FLASHINFER_SAMPLER=0
Thank you very much @xinli-centml . That seems to have solved the issue.
While we're glad the issue is resolved with the suggested setting, we don’t consider the matter fully closed. In our view, it’s problematic that the default Docker container doesn’t work out of the box on A100 GPUs — which are not particularly exotic — and requires us to manually set a parameter that’s rather obscure for most regular users.
I am curious if next build would fix your issue with this PR: https://github.com/vllm-project/vllm/pull/19297
Otherwise, it's possible that the error still exists on the main branch of https://github.com/flashinfer-ai/flashinfer and you might want to file a bug there.
Thanks for the follow-up. We're more on the "average Joe end-user" side of the VLLM ecosystem — we usually rely on the official Docker images rather than building from source or integrating bleeding-edge patches ourselves. So we’ll definitely test again once a new image is available that includes this fix.
On a broader note, because we had some rough time with VLLM releases now: This feels like a strategic crossroads for the project: Is VLLM aiming to be a polished product for end users, or more of a tinkering framework for ML experts?
If the goal is to be production-ready and broadly adopted, then ensuring the default Docker image works out of the box on common setups — especially widely-used platforms like A100 GPUs — is crucial. Right now, even getting a stable release of VLLM that supports core features like bigger Qwen models (including VL/image inputs), guided sampling, or OpenAI-compatible APIs with usage metrics can be a challenge. These aren’t edge cases — they’re part of a very common real-world deployment need. Something is always not working.
Our suggestion: Add automated tests for A100 SXM systems with larger models, include the necessary flags in the default image if they’re required for stability, and monitor in tests until the upstream fix is solid. We understand the complexity involved, especially with VLLM acting as an integration point for many moving parts — but in the end, users like us can't realistically track down and report bugs across multiple upstream repos when something fails deep inside the stack.
Same issue here on A100 setups - After just 12 hours it turned out vllm 0.9.1 experienced a crash or at least illegal memory access errors for multiple models, even embedding/reranker. Reverting to 0.8.5, which is bulletproof.
@andrePankraz I think the image might have passed (short) tests on A100, because the issue can take a while to surface. Also, I suppose, you didn't pay a dollar for what you get from the project, so instead you sometimes have to pay with tinkering and reporting/fixing bugs yourself.
Thank you @andrePankraz and @Ithanil for the feedbacks, I'll take an initiative to trying have this fixed and tracked.
I got the same error after a while running Llama 4 Maverick FP8 on a 4x H200 HGX system with v0.9.1 (bare-metal, self-compiled with CUDA 12.8 and with flashinfer).
I am curious if next build would fix your issue with this PR: https://github.com/vllm-project/vllm/pull/19297
Otherwise, it's possible that the error still exists on the main branch of https://github.com/flashinfer-ai/flashinfer and you might want to file a bug there.
We tried building a docker image with flashinfer 0.2.6.post1, which didn't solve this issue
@andrePankraz @FWao @leavelet do you have a specific request / benchmarking script that reproduces it? I'm currently running vllm/vllm-openai:v0.9.1 with benchmarking scripts with both top_p and top_k specified, concurrently with guided decoding tests using xgrammar, but server seems to behave fine.
I'm currently using a 1x RTX3090 (Ampere) workstation, before moving on to the 4x set up described above, I wonder if there is client code you have that can consistently reproduce the issue?
I've got same error trying to use guided choice from list of '1', '2', '3', but this error happen not all time. I was using qwen2.5-72b-awq and qwen3-32b-fp8
I have the same issue for Qwen32B
I tested Llama 4 Maverick, Qwen 2 and 3, and DeepSeek R1 with vLLM 0.9.1 in a newly built container from commit e28533a16f73a4eae01c2b7b1b4ddf3fc1beedab on H200 GPUs. I’m consistently seeing the same error across all models. This issue tends to show up more frequently when the server is handling heavier loads with varying input sequence lengths.
Additionally, I tried setting VLLM_USE_FLASHINFER_SAMPLER=0, but that didn’t resolve the issue either.
Sry I have no benchmark to trigger this problem reliably - this issue happens in real live load scenarios like previous commenter said.
What I can say: For us VLLM_USE_FLASHINFER_SAMPLER=0 did help and the crashes dont happen anymore. May be there are also other problems that trigger "CUDA an illegal memory access" and not everyone did the "CUDA_LAUNCH_BLOCKING=1" analysis.
Hi @andrePankraz, I tried Qwen/Qwen3-32B-FP8 on 4xA100 with a mix of guided and no guided requests under stress, still no luck, the server handles requests fine
Could you let me know your CUDA version on the host? I'm using Driver Version: 550.127.05 CUDA Version: 12.8
We had it with both:
NVIDIA-SMI 575.51.03 Driver Version: 575.51.03 CUDA Version: 12.9 NVIDIA-SMI 550.163.01 Driver Version: 550.163.01 CUDA Version: 12.4
We have the problem with Qweb2.5-72B on 4 x A100 SXM (see first post).
We dont have this problem von 4 x L40S (at least wasn triggered there, but on this test system we never have super high load).
I feel you - it's not nice to have such hard to reproduce stuff. I really cannot start to throw benchmarks onto the problem to trigger the problem etc. We don't have that much free server capacity and the time for such in-depth analysis as software-users.
@andrePankraz thanks for the info, I started with Qwen/Qwen3-32B-FP8 since Qweb2.5-72B didn't quite fit on 4xA100 40GBs, I'll poke around a bit more later in the week.
DW about reproduceability, reproducing an issue is basically 90% progress towards solving it :)
Hello, just in case this might help, I'm having a similar issue with mistralai/Mistral-Small-3.1-24B-Instruct-2503, vllm 0.9.1 and 1 H100 80GB.
I'm testing the fix mentioned above and will report back.
Update 2: I got the same error again today. Update: I re-built vllm again and since than it has been stable.
I can confirm that this still happens on vllm 0.9.2 with flashinfer-python 0.2.7.post1 using Llama4 Maverick FP8.
I tried to write scripts to replicate this issue, but did not succeed. The Llama4 instance is used by multiple people, mainly for heavy parallel agent tasks with a lot of tool calling and structured outputs (which might or might not be related to the issue). When using flashinfer, the crash happens again and again, but for me it is unclear, what triggers it. I only saw this problem with Llama 4 Maverick FP8 (did not try the bf16 version). On another instance with Qwen3-235B-A22B-FP8 however I never saw this issue (but might be related to how it is used).
This is how I run Llama 4 Maverick on H200 GPUs:
CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve /mnt/test/meta-llama_Llama-4-Maverick-17B-128E-Instruct-FP8 --tensor-parallel-size 4 --trust-remote-code --served-model-name Llama-4-Maverick-17B-128E-Instruct-FP8 --max-model-len 131072 --port 8001 --gpu-memory-utilization 0.95 --enable-auto-tool-choice --tool-call-parser llama4_pythonic --chat-template vllm/examples/tool_chat_template_llama4_pythonic.jinja --enable-chunked-prefill --enable-prefix-caching --guided-decoding-backend xgrammar --guided-decoding-disable-any-whitespace --override-generation-config='{"attn_temperature_tuning": true}'
Running on Ubuntu 24.04 with NVIDIA Driver 575.57.08
Could you try to reproduce fail with
export CUDA_LAUNCH_BLOCKING=1
Hello! It seem fix the bug that I have (same as the @andrePankraz have) We have a similar environment. 8*A800 SMX machine with Docker vLLM 0.9.1, Running the Qwen3-32B and Qwen3-14B I also use some feathers like FP8 and multi-lora.
same on me. this issue still stands with VLLM_USE_FLASHINFER_SAMPLER=0. Is there any PR to fix this issue?
- model: llama-3.3-70b-instruct-fp-dynamic
- vLLM configuration
- args:
- --model=/data/models/Llama-3.3-70B-Instruct-FP8-DYNAMIC
- --served-model-name=llm
- --tensor-parallel-size=2
- --pipeline-parallel-size=1
- --load-format=auto
- --prefix-caching-hash-algo=sha256
- --max-seq-len-to-capture=32768
- --max-model-len=32768
- --gpu-memory-utilization=0.9
- --num-scheduler-steps=1
- --disable-log-requests
- --trust-remote-code
- --uvicorn-log-level=warning
- --kv-cache-dtype=fp8 env:
- name: VLLM_USE_V1 value: “1”
- name: OMP_NUM_THREADS value: “8”
- name: VLLM_USE_FLASHINFER_SAMPLER value: “0”
- args:
vLLM version is v0.9.2
GPU: 2*H100
Driver Version: 535.183.06
CUDA Version: 12.8
The issue was reproduced every time the following load was applied.
- Number of input tokens = 1000
- Prefix caching hit ratio: 0% (information injected at the beginning of the prompt changes every time)
- max_tokens = 200
- ignore_eos = True
- temperature: 0.6
- top_p: 0.95
- Repetition penalty: 1.0
The situation was slightly different in vLLM v0.9.0.
- vLLM v0.9.0 + LLM_USE_FLASHINFER_SAMPLER=0: OK
- vLLM v0.9.0 + LLM_USE_FLASHINFER_SAMPLER=1: Same issue
In vLLM v0.8.4, it worked properly regardless of the LLM_USE_FLASHINFER_SAMPLER value.
I'm having a similar issue with leon-se/gemma-3-27b-it-FP8-Dynamic, vllm 0.9.2 and 1 H100 80GB.
Similar issue with Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 on H200. using docker image vllm/vllm-openai:v0.10.0 env: VLLM_USE_FLASHINFER_SAMPLER: 0 DEBUG: "true" TRANSFORMERS_OFFLINE: 1 args: - "--gpu-memory-utilization" - "0.98" - "--model" - "Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8" - "--enable-auto-tool-choice" - "--tool-call-parser" - "qwen3_coder" - "--tensor-parallel-size" - "4" - "--kv-cache-dtype" - "fp8" - "--rope-scaling" - '{"rope_type":"yarn","factor":2.0,"original_max_position_embeddings":262144}' - "--max-model-len" - "524288"
Should we add torch.cuda.synchronize() before valid_sampled_token_ids = sampled_token_ids.tolist()? or valid_sampled_token_ids = sampled_token_ids.cpu().pin_memory().tolist()
@andrePankraz Have you solved it yet?
@hnt2601 No - and we will not patch into the code or whatever and just wait for a working Docker container.
For us this worked to stabilize the container: VLLM_USE_FLASHINFER_SAMPLER: 0
For now we are fine with that, even though it might not be performance optimal. But without this setting, VLLM is unusable for us on described inrastructure.
@hnt2601 No - and we will not patch into the code or whatever and just wait for a working Docker container.
For us this worked to stabilize the container: VLLM_USE_FLASHINFER_SAMPLER: 0
For now we are fine with that, even though it might not be performance optimal. But without this setting, VLLM is unusable for us on described inrastructure.
It's not true. In v0.9.2, issue occurred even though VLLM_USE_FLASHINFER_SAMPLER is 0. Please refer to the results I tested above.
@hnt2601 No - and we will not patch into the code or whatever and just wait for a working Docker container. For us this worked to stabilize the container: VLLM_USE_FLASHINFER_SAMPLER: 0 For now we are fine with that, even though it might not be performance optimal. But without this setting, VLLM is unusable for us on described inrastructure.
It's not true. In v0.9.2, issue occurred even though VLLM_USE_FLASHINFER_SAMPLER is 0. Please refer to the results I tested above.
My interpretation is that disabling Flashinfer fixes it only for some setups/workloads. Because also for us it is enough to have stable vllm for weeks (A100 / mostly FP8 regular chat completions).