vllm
vllm copied to clipboard
[Bug]: RuntimeError: CUDA error: an illegal memory access was encountered. Qwen2.5-VL
Your current environment
The output of `python collect_env.py`
INFO 04-28 08:33:40 [__init__.py:239] Automatically detected platform cuda.
Collecting environment information...
PyTorch version: 2.6.0+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.31.4
Libc version: glibc-2.35
Python version: 3.11.11 | packaged by conda-forge | (main, Dec 5 2024, 14:17:24) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-5.4.0-204-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-PCIE-40GB
GPU 1: NVIDIA A100-PCIE-40GB
GPU 2: NVIDIA A100-PCIE-40GB
Nvidia driver version: 570.86.10
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 43 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 96
On-line CPU(s) list: 0-95
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7352 24-Core Processor
CPU family: 23
Model: 49
Thread(s) per core: 2
Core(s) per socket: 24
Socket(s): 2
Stepping: 0
BogoMIPS: 4591.72
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca sme sev sev_es
Virtualization: AMD-V
L1d cache: 1.5 MiB (48 instances)
L1i cache: 1.5 MiB (48 instances)
L2 cache: 24 MiB (48 instances)
L3 cache: 256 MiB (16 instances)
NUMA node(s): 2
NUMA node0 CPU(s): 0-23,48-71
NUMA node1 CPU(s): 24-47,72-95
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Vulnerable
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] flashinfer-python==0.2.5+cu124torch2.6
[pip3] numpy==2.2.2
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-cusparselt-cu12==0.6.2
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] optree==0.14.0
[pip3] pyzmq==26.4.0
[pip3] torch==2.6.0+cu124
[pip3] torchaudio==2.6.0+cu124
[pip3] torchcodec==0.3.0
[pip3] torchelastic==0.2.2
[pip3] torchvision==0.21.0+cu124
[pip3] transformers==4.51.3
[pip3] triton==3.2.0
[conda] flashinfer-python 0.2.5+cu124torch2.6 pypi_0 pypi
[conda] numpy 2.2.2 py311h5d046bc_0 conda-forge
[conda] nvidia-cublas-cu12 12.4.5.8 pypi_0 pypi
[conda] nvidia-cuda-cupti-cu12 12.4.127 pypi_0 pypi
[conda] nvidia-cuda-nvrtc-cu12 12.4.127 pypi_0 pypi
[conda] nvidia-cuda-runtime-cu12 12.4.127 pypi_0 pypi
[conda] nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi
[conda] nvidia-cufft-cu12 11.2.1.3 pypi_0 pypi
[conda] nvidia-curand-cu12 10.3.5.147 pypi_0 pypi
[conda] nvidia-cusolver-cu12 11.6.1.9 pypi_0 pypi
[conda] nvidia-cusparse-cu12 12.3.1.170 pypi_0 pypi
[conda] nvidia-cusparselt-cu12 0.6.2 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.21.5 pypi_0 pypi
[conda] nvidia-nvjitlink-cu12 12.4.127 pypi_0 pypi
[conda] nvidia-nvtx-cu12 12.4.127 pypi_0 pypi
[conda] optree 0.14.0 pypi_0 pypi
[conda] pyzmq 26.4.0 pypi_0 pypi
[conda] torch 2.6.0+cu124 pypi_0 pypi
[conda] torchaudio 2.6.0+cu124 pypi_0 pypi
[conda] torchcodec 0.3.0 pypi_0 pypi
[conda] torchelastic 0.2.2 pypi_0 pypi
[conda] torchvision 0.21.0+cu124 pypi_0 pypi
[conda] transformers 4.51.3 pypi_0 pypi
[conda] triton 3.2.0 pypi_0 pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.8.5.dev293+gaec9674db (git sha: aec9674db)
vLLM Build Flags:
CUDA Archs: 3.5;5.0;6.0;6.1;7.0;7.5;8.0;8.6+PTX; ROCm: Disabled; Neuron: Disabled
GPU Topology:
[4mGPU0 GPU1 GPU2 CPU Affinity NUMA Affinity GPU NUMA ID[0m
GPU0 X SYS SYS 0-23,48-71 0 N/A
GPU1 SYS X NODE 24-47,72-95 1 N/A
GPU2 SYS NODE X 24-47,72-95 1 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_REQUIRE_CUDA=cuda>=12.4 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526 brand=tesla,driver>=535,driver<536 brand=unknown,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=geforce,driver>=535,driver<536 brand=geforcertx,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=titan,driver>=535,driver<536 brand=titanrtx,driver>=535,driver<536
TORCH_CUDA_ARCH_LIST=3.5;5.0;6.0;6.1;7.0;7.5;8.0;8.6+PTX
NCCL_VERSION=2.21.5-1
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NVIDIA_PRODUCT_NAME=CUDA
CUDA_VERSION=12.4.1
PYTORCH_VERSION=2.6.0
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
CUDA_HOME=/usr/local/cuda-12.4/
CUDA_HOME=/usr/local/cuda-12.4/
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY
🐛 Describe the bug
I am trying to use vLLM for inferencing Qwen/Qwen2.5-VL-7B-Instruct with bitsandbytes quantization. Since there is no way to pass an already instantiated huggingface model, I have separately loaded and exported the model to a local path from where vLLM can load it. Here are the download and inference scripts:
download.py:
import logging
import os
import shutil
import torch
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration, BitsAndBytesConfig
os.environ["HF_HOME"] = "/vmdata/manan/.cache/huggingface"
CACHE_DIR = "/vmdata/manan/.cache/huggingface"
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
MODEL_DIR_BASE = "/vmdata/manan/"
def download_model(model_name: str, quantization: str = None):
"""Download and save model files under a subdirectory named after the given model name
Args:
model_name: The model name or path to download from
quantization: The quantization type, e.g., "4bnb" for 4-bit bitsandbytes
"""
# Adjust target directory based on quantization
target_suffix = f"-{quantization}" if quantization else ""
target_dir = os.path.join(MODEL_DIR_BASE, model_name.replace("/", "--") + target_suffix)
try:
# Create target directory structure
os.makedirs(target_dir, exist_ok=True)
logger.info(f"Downloading {model_name} processor configuration...")
processor = AutoProcessor.from_pretrained(
model_name,
cache_dir=CACHE_DIR,
local_files_only=False,
# token=True, # Use HF token from ~/.huggingface/token if required
)
processor.save_pretrained(target_dir)
logger.info(f"Downloading {model_name} model files{' with quantization: ' + quantization if quantization else ''}...")
# Configure quantization if specified
quantization_config = None
if quantization == "4bnb":
logger.info("Using 4-bit bitsandbytes double quantization...")
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True
)
with torch.inference_mode():
model_temp = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
offload_folder="offload",
offload_state_dict=True,
low_cpu_mem_usage=True,
quantization_config=quantization_config,
attn_implementation="flash_attention_2",
cache_dir=CACHE_DIR,
local_files_only=False,
# token=True, # Use HF token from ~/.huggingface/token if available
)
logger.info(f"Saving model to {target_dir}...")
model_temp.save_pretrained(
target_dir,
safe_serialization=True,
# max_shard_size="2GB"
)
# Cleanup temporary files
if os.path.exists("offload"):
shutil.rmtree("offload")
except Exception as e:
logger.error(f"Model download failed: {str(e)}", exc_info=True)
raise RuntimeError(f"Failed to save model '{model_name}'") from e
if __name__ == "__main__":
# Example usage - can be modified as needed
# For regular model download:
# download_model("Qwen/Qwen2.5-VL-7B-Instruct-AWQ")
# For 4-bit bitsandbytes model:
download_model("Qwen/Qwen2.5-VL-7B-Instruct", quantization="4bnb")
vllm_inference.py
from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info
import os
os.environ["HF_HOME"] = "/vmdata/manan/.cache/huggingface"
MODEL_PATH = "/vmdata/manan/Qwen--Qwen2.5-VL-7B-Instruct-4bnb"
llm = LLM(
model=MODEL_PATH,
limit_mm_per_prompt={"image": 0, "video": 1},
# enforce_eager=True, # need this if facing Triton kernel issues
)
sampling_params = SamplingParams(
temperature=0.1,
top_p=0.001,
repetition_penalty=1.05,
max_tokens=256,
stop_token_ids=[],
)
# For video input, you can pass following values instead:
# "type": "video",
# "video": "<video URL>",
video_messages = [
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": [
{"type": "text", "text": "请用表格总结一下视频中的商品特点"},
{
"type": "video",
"video": "https://duguang-labelling.oss-cn-shanghai.aliyuncs.com/qiansun/video_ocr/videos/50221078283.mp4",
"total_pixels": 20480 * 28 * 28,
"min_pixels": 16 * 28 * 28,
},
],
},
]
# Here we use video messages as a demonstration
messages = video_messages
processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
image_inputs, video_inputs, video_kwargs = process_vision_info(
messages, return_video_kwargs=True
)
mm_data = {}
if image_inputs is not None:
mm_data["image"] = image_inputs
if video_inputs is not None:
mm_data["video"] = video_inputs
llm_inputs = {
"prompt": prompt,
"multi_modal_data": mm_data,
"mm_processor_kwargs": video_kwargs,
}
outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)
The code example has been taken from the qwen repo. However, on running, I get the following error:
root@11a7becaa88c:/workspace/vlm# python vllm_inference.py > error.txt
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a
slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:00<00:00, 5.30it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.38it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.55it/s]
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:00<00:00, 5.19it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.21it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.37it/s]
Process EngineCore_0:
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 400, in run_engine_core
raise e
File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 387, in run_engine_core
engine_core = EngineCoreProc(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 329, in __init__
super().__init__(vllm_config, executor_class, log_stats,
File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 71, in __init__
self._initialize_kv_caches(vllm_config)
File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 129, in _initialize_kv_caches
available_gpu_memory = self.model_executor.determine_available_memory()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/vllm/v1/executor/abstract.py", line 75, in determine_available_memory
output = self.collective_rpc("determine_available_memory")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/vllm/utils.py", line 2456, in run_method
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 183, in determine_available_memory
self.model_runner.profile_run()
File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1651, in profile_run [107/1818]
hidden_states = self._dummy_run(self.max_num_tokens)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1497, in _dummy_run
outputs = model(
^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/qwen2_5_vl.py", line 1106, in forward
hidden_states = self.language_model.model(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 238, in __call__
output = self.compiled_callable(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 574, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 325, in forward
def forward(
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 745, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/fx/graph_module.py", line 822, in call_wrapped
return self._wrapped_call(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/fx/graph_module.py", line 400, in __call__
raise e
File "/opt/conda/lib/python3.11/site-packages/torch/fx/graph_module.py", line 387, in __call__ [67/1818]
return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<eval_with_key>.58", line 317, in forward
submod_0 = self.submod_0(l_inputs_embeds_, s0, l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_p
roj_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_bnb_shard_offsets, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_
proj_parameters_bias_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_, l_positions_, s2); l_self_modules_layers_modules_0_modules_input_layernor
m_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_w
eight_bnb_shard_offsets = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_bias_ = None
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/vllm/compilation/backends.py", line 612, in __call__
return self.compiled_graph_for_general_shape(*args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 745, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_functorch/aot_autograd.py", line 1184, in forward
return compiled_fn(full_args)
^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 323, in runtime_wrapper
all_outs = call_func_at_runtime_with_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/utils.py", line 126, in call_func_at_runtime_with_args
out = normalize_as_list(f(args))
^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 672, in inner_fn
outs = compiled_fn(args)
^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 490, in wrapper
return compiled_fn(runtime_args)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/output_code.py", line 466, in __call__
return self.current_callable(inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/utils.py", line 2128, in run
return model(new_inputs)
^^^^^^^^^^^^^^^^^
File "/root/.cache/vllm/torch_compile_cache/ae8e75b5c6/rank_0_0/inductor_cache/yt/cytecl5v5exjoeoz4hjoydpall6r5744eiqhyvs7endbjvcigoab.py", line 662, in call
triton_poi_fused_add_4.run(buf1, arg5_1, buf13, triton_poi_fused_add_4_xnumel, grid=grid(triton_poi_fused_add_4_xnumel), stream=stream0)
File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 1034, in run
self.autotune_to_one_config(*args, grid=grid, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 911, in autotune_to_one_config
timings = self.benchmark_all_configs(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 885, in benchmark_all_configs
timings = {
^
File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 886, in <dictcomp>
launcher: self.bench(launcher, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 787, in bench
return benchmarker.benchmark_gpu(kernel_call, rep=40)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/runtime/benchmarking.py", line 66, in wrapper
return fn(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/runtime/benchmarking.py", line 202, in benchmark_gpu
return self.triton_do_bench(_callable, **kwargs, return_mode="median")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/triton/testing.py", line 118, in do_bench
di.synchronize()
File "/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py", line 985, in synchronize
return torch._C._cuda_synchronize()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
The issue got resolved when I used enforce_eager=True. But I believe this might reduce the performance. So how can this be fixed?
Before submitting a new issue...
- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
I'll try to reproduce and address this issue.
Getting exact same issue with Qwen2-VL-7b-Instruct AWQ version with latest main branch. No issues with 0.8.4.
I'm using a simple: vllm serve modelpath
Happens with both 1 GPU or 2 GPU (TP=2)
Encounter exactly the same issue with vllm 0.8.5.
Can you try #17370 , it should fix this issue
Could you try https://github.com/vllm-project/vllm/pull/17435? Please rebuild from source
The issue still persists. I am using the vllm docker and upgrading the vllm version to the latest.
Dockerfile:
FROM vllm/vllm-openai:latest
RUN uv pip install -U vllm --extra-index-url https://wheels.vllm.ai/nightly
If torch 2.7.0 is used, this problem should not be encountered
@jeejeelee and I root caused it in https://github.com/vllm-project/vllm/pull/17435#issuecomment-2842210558.
The problem is that in PyTorch 2.6, Inductor sometimes erroneously changes the strides of input Tensors to triton kernels. This can and does lead to CUDA asserts, as seen in this issue. @eellison fixed this in PyTorch 2.7. We should be releasing vLLM binaries compatible with PyTorch 2.7 in the near future, so we can close this issue then.
Great to see that we know the root cause. So, by when can we expect the release using PyTorch 2.7?
Latest main branch has PyTorch 2.7 and I can confirm it fixes the issue for me.
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
vLLM uses Pytorch 2.7 now so this can be closed.