vllm [Bug]: RuntimeError: CUDA error: an illegal memory access was encountered. Qwen2.5-VL

Your current environment

The output of `python collect_env.py`

INFO 04-28 08:33:40 [__init__.py:239] Automatically detected platform cuda.
Collecting environment information...
PyTorch version: 2.6.0+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.31.4
Libc version: glibc-2.35

Python version: 3.11.11 | packaged by conda-forge | (main, Dec  5 2024, 14:17:24) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-5.4.0-204-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A100-PCIE-40GB
GPU 1: NVIDIA A100-PCIE-40GB
GPU 2: NVIDIA A100-PCIE-40GB

Nvidia driver version: 570.86.10
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      43 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             96
On-line CPU(s) list:                0-95
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 7352 24-Core Processor
CPU family:                         23
Model:                              49
Thread(s) per core:                 2
Core(s) per socket:                 24
Socket(s):                          2
Stepping:                           0
BogoMIPS:                           4591.72
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca sme sev sev_es
Virtualization:                     AMD-V
L1d cache:                          1.5 MiB (48 instances)
L1i cache:                          1.5 MiB (48 instances)
L2 cache:                           24 MiB (48 instances)
L3 cache:                           256 MiB (16 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-23,48-71
NUMA node1 CPU(s):                  24-47,72-95
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Vulnerable
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] flashinfer-python==0.2.5+cu124torch2.6
[pip3] numpy==2.2.2
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-cusparselt-cu12==0.6.2
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] optree==0.14.0
[pip3] pyzmq==26.4.0
[pip3] torch==2.6.0+cu124
[pip3] torchaudio==2.6.0+cu124
[pip3] torchcodec==0.3.0
[pip3] torchelastic==0.2.2
[pip3] torchvision==0.21.0+cu124
[pip3] transformers==4.51.3
[pip3] triton==3.2.0
[conda] flashinfer-python         0.2.5+cu124torch2.6          pypi_0    pypi
[conda] numpy                     2.2.2           py311h5d046bc_0    conda-forge
[conda] nvidia-cublas-cu12        12.4.5.8                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.4.127                 pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.2.1.3                 pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.5.147               pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.6.1.9                 pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.3.1.170               pypi_0    pypi
[conda] nvidia-cusparselt-cu12    0.6.2                    pypi_0    pypi
[conda] nvidia-nccl-cu12          2.21.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.4.127                 pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.4.127                 pypi_0    pypi
[conda] optree                    0.14.0                   pypi_0    pypi
[conda] pyzmq                     26.4.0                   pypi_0    pypi
[conda] torch                     2.6.0+cu124              pypi_0    pypi
[conda] torchaudio                2.6.0+cu124              pypi_0    pypi
[conda] torchcodec                0.3.0                    pypi_0    pypi
[conda] torchelastic              0.2.2                    pypi_0    pypi
[conda] torchvision               0.21.0+cu124             pypi_0    pypi
[conda] transformers              4.51.3                   pypi_0    pypi
[conda] triton                    3.2.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.8.5.dev293+gaec9674db (git sha: aec9674db)
vLLM Build Flags:
CUDA Archs: 3.5;5.0;6.0;6.1;7.0;7.5;8.0;8.6+PTX; ROCm: Disabled; Neuron: Disabled
GPU Topology:
	[4mGPU0	GPU1	GPU2	CPU Affinity	NUMA Affinity	GPU NUMA ID[0m
GPU0	 X 	SYS	SYS	0-23,48-71	0		N/A
GPU1	SYS	 X 	NODE	24-47,72-95	1		N/A
GPU2	SYS	NODE	 X 	24-47,72-95	1		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NVIDIA_VISIBLE_DEVICES=all
NVIDIA_REQUIRE_CUDA=cuda>=12.4 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526 brand=tesla,driver>=535,driver<536 brand=unknown,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=geforce,driver>=535,driver<536 brand=geforcertx,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=titan,driver>=535,driver<536 brand=titanrtx,driver>=535,driver<536
TORCH_CUDA_ARCH_LIST=3.5;5.0;6.0;6.1;7.0;7.5;8.0;8.6+PTX
NCCL_VERSION=2.21.5-1
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NVIDIA_PRODUCT_NAME=CUDA
CUDA_VERSION=12.4.1
PYTORCH_VERSION=2.6.0
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
CUDA_HOME=/usr/local/cuda-12.4/
CUDA_HOME=/usr/local/cuda-12.4/
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY

🐛 Describe the bug

I am trying to use vLLM for inferencing Qwen/Qwen2.5-VL-7B-Instruct with bitsandbytes quantization. Since there is no way to pass an already instantiated huggingface model, I have separately loaded and exported the model to a local path from where vLLM can load it. Here are the download and inference scripts:

download.py:

import logging
import os
import shutil
import torch
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration, BitsAndBytesConfig

os.environ["HF_HOME"] = "/vmdata/manan/.cache/huggingface"
CACHE_DIR = "/vmdata/manan/.cache/huggingface"

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

MODEL_DIR_BASE = "/vmdata/manan/"

def download_model(model_name: str, quantization: str = None):
    """Download and save model files under a subdirectory named after the given model name
    
    Args:
        model_name: The model name or path to download from
        quantization: The quantization type, e.g., "4bnb" for 4-bit bitsandbytes
    """

    # Adjust target directory based on quantization
    target_suffix = f"-{quantization}" if quantization else ""
    target_dir = os.path.join(MODEL_DIR_BASE, model_name.replace("/", "--") + target_suffix)

    try:
        # Create target directory structure
        os.makedirs(target_dir, exist_ok=True)

        logger.info(f"Downloading {model_name} processor configuration...")
        processor = AutoProcessor.from_pretrained(
            model_name,
            cache_dir=CACHE_DIR,
            local_files_only=False,
            # token=True,  # Use HF token from ~/.huggingface/token if required
        )
        processor.save_pretrained(target_dir)

        logger.info(f"Downloading {model_name} model files{' with quantization: ' + quantization if quantization else ''}...")
        
        # Configure quantization if specified
        quantization_config = None
        if quantization == "4bnb":
            logger.info("Using 4-bit bitsandbytes double quantization...")
            quantization_config = BitsAndBytesConfig(
                load_in_4bit=True, 
                bnb_4bit_quant_type="nf4", 
                bnb_4bit_use_double_quant=True
            )
        
        with torch.inference_mode():
            model_temp = Qwen2_5_VLForConditionalGeneration.from_pretrained(
                model_name,
                torch_dtype="auto",
                device_map="auto",
                offload_folder="offload",
                offload_state_dict=True,
                low_cpu_mem_usage=True,
                quantization_config=quantization_config,
                attn_implementation="flash_attention_2",
                cache_dir=CACHE_DIR,
                local_files_only=False,
                # token=True,  # Use HF token from ~/.huggingface/token if available
            )

        logger.info(f"Saving model to {target_dir}...")
        model_temp.save_pretrained(
            target_dir,
            safe_serialization=True,
            # max_shard_size="2GB"
        )

        # Cleanup temporary files
        if os.path.exists("offload"):
            shutil.rmtree("offload")

    except Exception as e:
        logger.error(f"Model download failed: {str(e)}", exc_info=True)
        raise RuntimeError(f"Failed to save model '{model_name}'") from e
    
if __name__ == "__main__":
    # Example usage - can be modified as needed
    # For regular model download:
    # download_model("Qwen/Qwen2.5-VL-7B-Instruct-AWQ")
    
    # For 4-bit bitsandbytes model:
    download_model("Qwen/Qwen2.5-VL-7B-Instruct", quantization="4bnb")

vllm_inference.py

from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info
import os

os.environ["HF_HOME"] = "/vmdata/manan/.cache/huggingface"

MODEL_PATH = "/vmdata/manan/Qwen--Qwen2.5-VL-7B-Instruct-4bnb"

llm = LLM(
    model=MODEL_PATH,
    limit_mm_per_prompt={"image": 0, "video": 1},
    # enforce_eager=True, # need this if facing Triton kernel issues
)

sampling_params = SamplingParams(
    temperature=0.1,
    top_p=0.001,
    repetition_penalty=1.05,
    max_tokens=256,
    stop_token_ids=[],
)

# For video input, you can pass following values instead:
# "type": "video",
# "video": "<video URL>",
video_messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "请用表格总结一下视频中的商品特点"},
            {
                "type": "video",
                "video": "https://duguang-labelling.oss-cn-shanghai.aliyuncs.com/qiansun/video_ocr/videos/50221078283.mp4",
                "total_pixels": 20480 * 28 * 28,
                "min_pixels": 16 * 28 * 28,
            },
        ],
    },
]

# Here we use video messages as a demonstration
messages = video_messages

processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
image_inputs, video_inputs, video_kwargs = process_vision_info(
    messages, return_video_kwargs=True
)

mm_data = {}
if image_inputs is not None:
    mm_data["image"] = image_inputs
if video_inputs is not None:
    mm_data["video"] = video_inputs

llm_inputs = {
    "prompt": prompt,
    "multi_modal_data": mm_data,
    "mm_processor_kwargs": video_kwargs,
}

outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text

print(generated_text)

The code example has been taken from the qwen repo. However, on running, I get the following error:

root@11a7becaa88c:/workspace/vlm# python vllm_inference.py > error.txt                                                                                                                        
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a 
slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.                                                         
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]                                                                                                                  
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  5.30it/s]                                                                                                          
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.38it/s]                                                                                                          
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.55it/s]

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]                                                                                                                  
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  5.19it/s]                                                                                                          
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.21it/s]                                                                                                          
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.37it/s]                                                                                                          
                                                                                                                                                                                              
Process EngineCore_0:                                                                                                                                                                         
Traceback (most recent call last):                                                                                                                                                            
  File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap                                                                                                        
    self.run()                                                                                                                                                                                
  File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 108, in run                                                                                                               
    self._target(*self._args, **self._kwargs)                                                                                                                                                 
  File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 400, in run_engine_core                                                                                         
    raise e                                                                                                                                                                                   
  File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 387, in run_engine_core                                                                                         
    engine_core = EngineCoreProc(*args, **kwargs)                                                                                                                                             
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                             
  File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 329, in __init__                                                                                                
    super().__init__(vllm_config, executor_class, log_stats,                                                                                                                                  
  File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 71, in __init__                                                                                                 
    self._initialize_kv_caches(vllm_config)                                                                                                                                                   
  File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 129, in _initialize_kv_caches                                                                                   
    available_gpu_memory = self.model_executor.determine_available_memory()                                                                                                                   
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                   
  File "/opt/conda/lib/python3.11/site-packages/vllm/v1/executor/abstract.py", line 75, in determine_available_memory                                                                         
    output = self.collective_rpc("determine_available_memory")                                                                                                                                
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                
  File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc                                                                                
    answer = run_method(self.driver_worker, method, args, kwargs)                                                                                                                             
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                             
  File "/opt/conda/lib/python3.11/site-packages/vllm/utils.py", line 2456, in run_method                                                                                                      
    return func(*args, **kwargs)                                                                                                                                                              
           ^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                              
  File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context                                                                                    
    return func(*args, **kwargs)                                                                                                                                                              
           ^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                              
  File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 183, in determine_available_memory                                                                        
    self.model_runner.profile_run()

File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1651, in profile_run                                                                      [107/1818]
    hidden_states = self._dummy_run(self.max_num_tokens)                                                                                                                                      
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                      
  File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context                                                                                    
    return func(*args, **kwargs)                                                                                                                                                              
           ^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                              
  File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1497, in _dummy_run                                                                                 
    outputs = model(                                                                                                                                                                          
              ^^^^^^                                                                                                                                                                          
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl                                                                                 
    return self._call_impl(*args, **kwargs)                                                                                                                                                   
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                   
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl                                                                                         
    return forward_call(*args, **kwargs)                                                                                                                                                      
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                      
  File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/qwen2_5_vl.py", line 1106, in forward                                                                              
    hidden_states = self.language_model.model(                                                                                                                                                
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                
  File "/opt/conda/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 238, in __call__                                                                                        
    output = self.compiled_callable(*args, **kwargs)                                                                                                                                          
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                          
  File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 574, in _fn                                                                                                
    return fn(*args, **kwargs)                                                                                                                                                                
           ^^^^^^^^^^^^^^^^^^^                                                                                                                                                                
  File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 325, in forward                                                                                    
    def forward(                                                                                                                                                                              
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl                                                                                 
    return self._call_impl(*args, **kwargs)                                                                                                                                                   
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                   
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl                                                                                         
    return forward_call(*args, **kwargs)                                                                                                                                                      
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                      
  File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 745, in _fn                                                                                                
    return fn(*args, **kwargs)                                                                                                                                                                
           ^^^^^^^^^^^^^^^^^^^                                                                                                                                                                
  File "/opt/conda/lib/python3.11/site-packages/torch/fx/graph_module.py", line 822, in call_wrapped                                                                                          
    return self._wrapped_call(self, *args, **kwargs)                                                                                                                                          
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                          
  File "/opt/conda/lib/python3.11/site-packages/torch/fx/graph_module.py", line 400, in __call__                                                                                              
    raise e

  File "/opt/conda/lib/python3.11/site-packages/torch/fx/graph_module.py", line 387, in __call__                                                                                     [67/1818]
    return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]                                                                                                               
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                     
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl                                                                                 
    return self._call_impl(*args, **kwargs)                                                                                                                                                   
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                   
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl                                                                                         
    return forward_call(*args, **kwargs)                                                                                                                                                      
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                      
  File "<eval_with_key>.58", line 317, in forward                                                                                                                                             
    submod_0 = self.submod_0(l_inputs_embeds_, s0, l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_p
roj_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_bnb_shard_offsets, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_
proj_parameters_bias_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_, l_positions_, s2);  l_self_modules_layers_modules_0_modules_input_layernor
m_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_w
eight_bnb_shard_offsets = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_bias_ = None                                                                          
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                         
  File "/opt/conda/lib/python3.11/site-packages/vllm/compilation/backends.py", line 612, in __call__                                                                                          
    return self.compiled_graph_for_general_shape(*args)                                                                                                                                       
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                       
  File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 745, in _fn                                                                                                
    return fn(*args, **kwargs)                                                                                                                                                                
           ^^^^^^^^^^^^^^^^^^^                                                                                                                                                                
  File "/opt/conda/lib/python3.11/site-packages/torch/_functorch/aot_autograd.py", line 1184, in forward                                                                                      
    return compiled_fn(full_args)                                                                                                                                                             
           ^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                             
  File "/opt/conda/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 323, in runtime_wrapper                                                             
    all_outs = call_func_at_runtime_with_args(                                                                                                                                                
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                
  File "/opt/conda/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/utils.py", line 126, in call_func_at_runtime_with_args                                                         
    out = normalize_as_list(f(args))                                                                                                                                                          
                            ^^^^^^^                                                                                                                                                           
  File "/opt/conda/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 672, in inner_fn                                                                    
    outs = compiled_fn(args)                                                                                                                                                                  
           ^^^^^^^^^^^^^^^^^                                                                                                                                                                  
  File "/opt/conda/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 490, in wrapper                                                                     
    return compiled_fn(runtime_args)                                                                                                                                                          
           ^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/output_code.py", line 466, in __call__                                                                                        
    return self.current_callable(inputs)                                                                                                                                                      
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                      
  File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/utils.py", line 2128, in run                                                                                                  
    return model(new_inputs)                                                                                                                                                                  
           ^^^^^^^^^^^^^^^^^                                                                                                                                                                  
  File "/root/.cache/vllm/torch_compile_cache/ae8e75b5c6/rank_0_0/inductor_cache/yt/cytecl5v5exjoeoz4hjoydpall6r5744eiqhyvs7endbjvcigoab.py", line 662, in call                               
    triton_poi_fused_add_4.run(buf1, arg5_1, buf13, triton_poi_fused_add_4_xnumel, grid=grid(triton_poi_fused_add_4_xnumel), stream=stream0)                                                  
  File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 1034, in run                                                                              
    self.autotune_to_one_config(*args, grid=grid, **kwargs)                                                                                                                                   
  File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 911, in autotune_to_one_config                                                            
    timings = self.benchmark_all_configs(*args, **kwargs)                                                                                                                                     
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                     
  File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 885, in benchmark_all_configs                                                             
    timings = {                                                                                                                                                                               
              ^                                                                                                                                                                               
  File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 886, in <dictcomp>                                                                        
    launcher: self.bench(launcher, *args, **kwargs)                                                                                                                                           
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                           
  File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 787, in bench                                                                             
    return benchmarker.benchmark_gpu(kernel_call, rep=40)                                                                                                                                     
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                     
  File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/runtime/benchmarking.py", line 66, in wrapper                                                                                 
    return fn(self, *args, **kwargs)                                                                                                                                                          
           ^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                          
  File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/runtime/benchmarking.py", line 202, in benchmark_gpu                                                                          
    return self.triton_do_bench(_callable, **kwargs, return_mode="median")                                                                                                                    
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                    
  File "/opt/conda/lib/python3.11/site-packages/triton/testing.py", line 118, in do_bench                                                                                                     
    di.synchronize()
  File "/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py", line 985, in synchronize
    return torch._C._cuda_synchronize()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

The issue got resolved when I used enforce_eager=True. But I believe this might reduce the performance. So how can this be fixed?

Before submitting a new issue...

[x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Apr 28 '25 08:04 cs-mshah

I'll try to reproduce and address this issue.

Apr 28 '25 12:04 jeejeelee

Getting exact same issue with Qwen2-VL-7b-Instruct AWQ version with latest main branch. No issues with 0.8.4.

I'm using a simple: vllm serve modelpath

Happens with both 1 GPU or 2 GPU (TP=2)

Apr 28 '25 19:04 cedonley

Encounter exactly the same issue with vllm 0.8.5.

Apr 29 '25 09:04 Linnore

Can you try #17370 , it should fix this issue

Apr 29 '25 14:04 jeejeelee

Could you try https://github.com/vllm-project/vllm/pull/17435? Please rebuild from source

Apr 30 '25 08:04 jeejeelee

The issue still persists. I am using the vllm docker and upgrading the vllm version to the latest.

Dockerfile:

FROM vllm/vllm-openai:latest

RUN uv pip install -U vllm --extra-index-url https://wheels.vllm.ai/nightly

Apr 30 '25 13:04 cs-mshah

If torch 2.7.0 is used, this problem should not be encountered

Apr 30 '25 14:04 jeejeelee

@jeejeelee and I root caused it in https://github.com/vllm-project/vllm/pull/17435#issuecomment-2842210558.

The problem is that in PyTorch 2.6, Inductor sometimes erroneously changes the strides of input Tensors to triton kernels. This can and does lead to CUDA asserts, as seen in this issue. @eellison fixed this in PyTorch 2.7. We should be releasing vLLM binaries compatible with PyTorch 2.7 in the near future, so we can close this issue then.

Apr 30 '25 15:04 zou3519

Great to see that we know the root cause. So, by when can we expect the release using PyTorch 2.7?

May 01 '25 11:05 cs-mshah

Latest main branch has PyTorch 2.7 and I can confirm it fixes the issue for me.

May 01 '25 12:05 cedonley

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

Jul 31 '25 02:07 github-actions[bot]

vLLM uses Pytorch 2.7 now so this can be closed.

Jul 31 '25 15:07 zou3519

vllm vllm copied to clipboard

[Bug]: RuntimeError: CUDA error: an illegal memory access was encountered. Qwen2.5-VL

Your current environment

🐛 Describe the bug

Before submitting a new issue...

vllm
vllm copied to clipboard