vllm [Bug]: I am traing to run unsloth/phi-4-bnb-4bit but I am getting always the same error Validation Error:1 validatiopn error for modelconfig Infer_schema(func): Parameter block

Your current environment

vllm serve "model_path" --quantization bitsandbytes
--load-format bitsandbytes
--dtype half
--block-size 32 \ --max-model-len 10k

🐛 Describe the bug

I am traing to run unsloth/phi-4-bnb-4bit but I am getting always the same error Validation Error:1 validatiopn error for modelconfig Infer_schema(func): Parameter block_size has unsupported type list[int]: The valid type are: dict_keys class 'torch.tensor' typing.Optional[torch.Tensor]... I have used --block_size 32 and I have even changed block_size=32 in vllm/vllm/config .py

They used unsloth/tinyllama-bnb-4bit in docs but I can't see the difference of this model with unsloth/phi-4-bnb-4bit

Could some one help me to run a quantized model in vllm?

thanks

Before submitting a new issue...

[x] #19629

Jun 13 '25 22:06 Salomonmejia1

Same issue here.

Jun 19 '25 23:06 shinjitumala

Just tested a bit more and this only seems to happen on Rocm. Worked fine in my Nvidia 4090 machine but failed in the 7900XTX machine with the same error.

Jun 20 '25 14:06 shinjitumala

I am using a RTX 5080 and 5090 but both shown same error.

Jun 20 '25 16:06 Salomonmejia1

same here for qwen awq models using awq_marlin

Jun 24 '25 02:06 wojangAI

same issue with fp8

Jun 30 '25 15:06 ExtReMLapin

pytorch 2.5 env build from source. got same issue during serve medgemma-27b-text-it-unsloth-bnb-4bit.

debug in file vllm-0.9.1/vllm/utils.py , function direct_register_custom_op call failed in some place, the reason maybe torch.library.infer_schema can not recognize list[int].

    if hasattr(torch.library, "infer_schema"):
        schema_str = torch.library.infer_schema(op_func,
                                                mutates_args=mutates_args)

in file vllm-0.9.1/vllm/model_executor/layers/quantization/utils/fp8_utils.py

direct_register_custom_op(
    op_name="apply_w8a8_block_fp8_linear",
    op_func=apply_w8a8_block_fp8_linear,
    mutates_args=[],
    fake_impl=apply_w8a8_block_fp8_linear_fake,
)

in file vllm-0.9.1/vllm/model_executor/layers/fused_moe/fused_moe.py

direct_register_custom_op(
    op_name="outplace_fused_experts",
    op_func=outplace_fused_experts,
    mutates_args=[],
    fake_impl=outplace_fused_experts_fake,
    tags=(torch.Tag.needs_fixed_stride_order, ),
)

direct_register_custom_op(
    op_name="inplace_fused_experts",
    op_func=inplace_fused_experts,
    mutates_args=["hidden_states"],
    fake_impl=inplace_fused_experts_fake,
    tags=(torch.Tag.needs_fixed_stride_order, ),
)

replace parameter block_size: list[int], with block_size: List[int], or block_shape: Optional[list[int]] with block_shape: Optional[List[int]] in apply_w8a8_block_fp8_linear, outplace_fused_experts and inplace_fused_experts functions. and add from typing import List in the head of files.

Jul 02 '25 00:07 QuTest

If you're looking for a smooth way to deploy vLLM with 4-bit quantization, here's a solid base setup using Docker, NVIDIA's optimized image and bitsandbytes.

We're starting from this base image:

nvcr.io/nvidia/tritonserver:25.05-vllm-python-py3

To get bitsandbytes working inside it, we just add a few system dependencies and install the library via pip.

🛠️ Dockerfile

  FROM nvcr.io/nvidia/tritonserver:25.05-vllm-python-py3

 ENV DEBIAN_FRONTEND=noninteractive

 RUN apt-get update && \
     apt-get install -y \
         build-essential \
         python3-dev \
         libopenmpi-dev \
     && rm -rf /var/lib/apt/lists/*

 RUN pip3 install --no-cache-dir bitsandbytes --break-system-packages

 RUN python3 -c "import bitsandbytes as bnb; print('BitsAndBytes version:', bnb.__version__)"

This setup makes sure everything is in place to run 4-bit quantized models.

🧪 Run the container Once the image is built, you can run inference on a model like phi4 using:

 docker run --gpus all \
     -v /path/local/models:/models \
     -p 8000:8000 \
     --shm-size=1g \
     --ulimit memlock=-1 \
     --ulimit stack=67108864 \
     --rm \
   your_image_name \
   vllm serve \
      /models/phi4 \
     --port 8000 \
     --dtype auto \
     --max-model-len 12000
     --block_size 16 
     --served-model-name phi4:14b

This will expose the OpenAI-compatible API at http://IP:8000/v1/chat/completions.

Note

Comparing:

phi4-14B-Q4_K_M via Ollama

vs. unsloth/phi-4-unsloth-bnb-4bit in vLLM

I found the Ollama version runs slightly faster — which is a bit annoying 😅. Still, vLLM gives us more flexibility and is much easier to scale.

happy to trade tips!

Jul 10 '25 13:07 Salomonmejia1

I had this issue because of pytorch version

Jul 10 '25 14:07 ExtReMLapin

I'm using 2 AMD Radeon 7900 XTX video cards and I get this error when starting any model:

vllm serve /app/model/Qwen2.5-Coder-14B-Instruct --port 8002 --tensor-parallel-size 2 --gpu-memory-utilization 0.9 --max-model-len 64000
INFO 07-14 11:46:37 [__init__.py:253] Automatically detected platform rocm.
Traceback (most recent call last):
  File "/usr/local/bin/vllm", line 5, in <module>
    from vllm.entrypoints.cli.main import main
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/__init__.py", line 4, in <module>
    from vllm.entrypoints.cli.benchmark.serve import BenchmarkServingSubcommand
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/benchmark/serve.py", line 5, in <module>
    from vllm.benchmarks.serve import add_cli_args, main
  File "/usr/local/lib/python3.12/dist-packages/vllm/benchmarks/serve.py", line 35, in <module>
    from vllm.benchmarks.datasets import (SampleRequest, add_dataset_parser,
  File "/usr/local/lib/python3.12/dist-packages/vllm/benchmarks/datasets.py", line 31, in <module>
    from vllm.lora.utils import get_adapter_absolute_path
  File "/usr/local/lib/python3.12/dist-packages/vllm/lora/utils.py", line 37, in <module>
    from vllm.model_executor.models.utils import WeightsMapper
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 17, in <module>
    from vllm.model_executor.model_loader.weight_utils import default_weight_loader
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 10, in <module>
    from vllm.model_executor.model_loader.bitsandbytes_loader import (
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/bitsandbytes_loader.py", line 23, in <module>
    from vllm.model_executor.layers.fused_moe import FusedMoE
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/__init__.py", line 8, in <module>
    from vllm.model_executor.layers.fused_moe.layer import (
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 26, in <module>
    from vllm.model_executor.layers.fused_moe.modular_kernel import (
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/modular_kernel.py", line 13, in <module>
    from vllm.model_executor.layers.fused_moe.utils import (  # yapf: disable
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/utils.py", line 9, in <module>
    from vllm.model_executor.layers.quantization.utils.fp8_utils import (
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/fp8_utils.py", line 78, in <module>
    direct_register_custom_op(
  File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 2492, in direct_register_custom_op
    schema_str = torch.library.infer_schema(op_func,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_library/infer_schema.py", line 106, in infer_schema
    error_fn(
  File "/usr/local/lib/python3.12/dist-packages/torch/_library/infer_schema.py", line 58, in error_fn
    raise ValueError(
ValueError: infer_schema(func): Parameter block_size has unsupported type list[int]. The valid types are: dict_keys([<class 'torch.Tensor'>, typing.Optional[torch.Tensor], typing.Sequence[torch.Tensor], typing.List[torch.Tensor], typing.Sequence[typing.Optional[torch.Tensor]], typing.List[typing.Optional[torch.Tensor]], <class 'int'>, typing.Optional[int], typing.Sequence[int], typing.List[int], typing.Optional[typing.Sequence[int]], typing.Optional[typing.List[int]], <class 'float'>, typing.Optional[float], typing.Sequence[float], typing.List[float], typing.Optional[typing.Sequence[float]], typing.Optional[typing.List[float]], <class 'bool'>, typing.Optional[bool], typing.Sequence[bool], typing.List[bool], typing.Optional[typing.Sequence[bool]], typing.Optional[typing.List[bool]], <class 'str'>, typing.Optional[str], typing.Union[int, float, bool], typing.Union[int, float, bool, NoneType], typing.Sequence[typing.Union[int, float, bool]], typing.List[typing.Union[int, float, bool]], <class 'torch.dtype'>, typing.Optional[torch.dtype], <class 'torch.device'>, typing.Optional[torch.device]]). Got func with signature (A: torch.Tensor, B: torch.Tensor, As: torch.Tensor, Bs: torch.Tensor, block_size: list[int], output_dtype: torch.dtype = torch.float16) -> torch.Tensor)

Jul 14 '25 11:07 hackey

I'm using 2 AMD Radeon 7900 XTX video cards and I get this error when starting any model:

vllm serve /app/model/Qwen2.5-Coder-14B-Instruct --port 8002 --tensor-parallel-size 2 --gpu-memory-utilization 0.9 --max-model-len 64000
INFO 07-14 11:46:37 [__init__.py:253] Automatically detected platform rocm.
Traceback (most recent call last):
  File "/usr/local/bin/vllm", line 5, in <module>
    from vllm.entrypoints.cli.main import main
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/__init__.py", line 4, in <module>
    from vllm.entrypoints.cli.benchmark.serve import BenchmarkServingSubcommand
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/benchmark/serve.py", line 5, in <module>
    from vllm.benchmarks.serve import add_cli_args, main
  File "/usr/local/lib/python3.12/dist-packages/vllm/benchmarks/serve.py", line 35, in <module>
    from vllm.benchmarks.datasets import (SampleRequest, add_dataset_parser,
  File "/usr/local/lib/python3.12/dist-packages/vllm/benchmarks/datasets.py", line 31, in <module>
    from vllm.lora.utils import get_adapter_absolute_path
  File "/usr/local/lib/python3.12/dist-packages/vllm/lora/utils.py", line 37, in <module>
    from vllm.model_executor.models.utils import WeightsMapper
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 17, in <module>
    from vllm.model_executor.model_loader.weight_utils import default_weight_loader
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 10, in <module>
    from vllm.model_executor.model_loader.bitsandbytes_loader import (
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/bitsandbytes_loader.py", line 23, in <module>
    from vllm.model_executor.layers.fused_moe import FusedMoE
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/__init__.py", line 8, in <module>
    from vllm.model_executor.layers.fused_moe.layer import (
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 26, in <module>
    from vllm.model_executor.layers.fused_moe.modular_kernel import (
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/modular_kernel.py", line 13, in <module>
    from vllm.model_executor.layers.fused_moe.utils import (  # yapf: disable
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/utils.py", line 9, in <module>
    from vllm.model_executor.layers.quantization.utils.fp8_utils import (
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/fp8_utils.py", line 78, in <module>
    direct_register_custom_op(
  File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 2492, in direct_register_custom_op
    schema_str = torch.library.infer_schema(op_func,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_library/infer_schema.py", line 106, in infer_schema
    error_fn(
  File "/usr/local/lib/python3.12/dist-packages/torch/_library/infer_schema.py", line 58, in error_fn
    raise ValueError(
ValueError: infer_schema(func): Parameter block_size has unsupported type list[int]. The valid types are: dict_keys([<class 'torch.Tensor'>, typing.Optional[torch.Tensor], typing.Sequence[torch.Tensor], typing.List[torch.Tensor], typing.Sequence[typing.Optional[torch.Tensor]], typing.List[typing.Optional[torch.Tensor]], <class 'int'>, typing.Optional[int], typing.Sequence[int], typing.List[int], typing.Optional[typing.Sequence[int]], typing.Optional[typing.List[int]], <class 'float'>, typing.Optional[float], typing.Sequence[float], typing.List[float], typing.Optional[typing.Sequence[float]], typing.Optional[typing.List[float]], <class 'bool'>, typing.Optional[bool], typing.Sequence[bool], typing.List[bool], typing.Optional[typing.Sequence[bool]], typing.Optional[typing.List[bool]], <class 'str'>, typing.Optional[str], typing.Union[int, float, bool], typing.Union[int, float, bool, NoneType], typing.Sequence[typing.Union[int, float, bool]], typing.List[typing.Union[int, float, bool]], <class 'torch.dtype'>, typing.Optional[torch.dtype], <class 'torch.device'>, typing.Optional[torch.device]]). Got func with signature (A: torch.Tensor, B: torch.Tensor, As: torch.Tensor, Bs: torch.Tensor, block_size: list[int], output_dtype: torch.dtype = torch.float16) -> torch.Tensor)

same issue here

Jul 16 '25 10:07 Sparkchen85

Fix for me:

DOCKER_BUILDKIT=1 docker build     --build-arg BASE_IMAGE="rocm/vllm-dev:rocm6.4.1_navi_ubuntu24.04_py3.12_pytorch_2.7_vllm_0.8.5"     -f docker/Dockerfile.rocm     -t vllm-rocm:10.0rc     .

and execute:

pip install numpy==1.26.4 in docker container

Jul 21 '25 08:07 hackey

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

Oct 20 '25 02:10 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

Nov 20 '25 02:11 github-actions[bot]

[Bug]: I am traing to run unsloth/phi-4-bnb-4bit but I am getting always the same error Validation Error:1 validatiopn error for modelconfig Infer_schema(func): Parameter block_size has unsupported type list[int]

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Note