[Bug]: I am traing to run unsloth/phi-4-bnb-4bit but I am getting always the same error Validation Error:1 validatiopn error for modelconfig Infer_schema(func): Parameter block_size has unsupported type list[int]
Your current environment
vllm serve "model_path" --quantization bitsandbytes
--load-format bitsandbytes
--dtype half
--block-size 32 \
--max-model-len 10k
๐ Describe the bug
I am traing to run unsloth/phi-4-bnb-4bit but I am getting always the same error Validation Error:1 validatiopn error for modelconfig Infer_schema(func): Parameter block_size has unsupported type list[int]: The valid type are: dict_keys class 'torch.tensor' typing.Optional[torch.Tensor]... I have used --block_size 32 and I have even changed block_size=32 in vllm/vllm/config .py
They used unsloth/tinyllama-bnb-4bit in docs but I can't see the difference of this model with unsloth/phi-4-bnb-4bit
Could some one help me to run a quantized model in vllm?
thanks
Before submitting a new issue...
- [x] #19629
Same issue here.
Just tested a bit more and this only seems to happen on Rocm. Worked fine in my Nvidia 4090 machine but failed in the 7900XTX machine with the same error.
I am using a RTX 5080 and 5090 but both shown same error.
same here for qwen awq models using awq_marlin
same issue with fp8
pytorch 2.5 env build from source. got same issue during serve medgemma-27b-text-it-unsloth-bnb-4bit.
debug in file vllm-0.9.1/vllm/utils.py , function direct_register_custom_op call failed in some place, the reason maybe torch.library.infer_schema can not recognize list[int].
if hasattr(torch.library, "infer_schema"):
schema_str = torch.library.infer_schema(op_func,
mutates_args=mutates_args)
in file vllm-0.9.1/vllm/model_executor/layers/quantization/utils/fp8_utils.py
direct_register_custom_op(
op_name="apply_w8a8_block_fp8_linear",
op_func=apply_w8a8_block_fp8_linear,
mutates_args=[],
fake_impl=apply_w8a8_block_fp8_linear_fake,
)
in file vllm-0.9.1/vllm/model_executor/layers/fused_moe/fused_moe.py
direct_register_custom_op(
op_name="outplace_fused_experts",
op_func=outplace_fused_experts,
mutates_args=[],
fake_impl=outplace_fused_experts_fake,
tags=(torch.Tag.needs_fixed_stride_order, ),
)
direct_register_custom_op(
op_name="inplace_fused_experts",
op_func=inplace_fused_experts,
mutates_args=["hidden_states"],
fake_impl=inplace_fused_experts_fake,
tags=(torch.Tag.needs_fixed_stride_order, ),
)
replace parameter block_size: list[int], with block_size: List[int],
or block_shape: Optional[list[int]] with block_shape: Optional[List[int]]
in apply_w8a8_block_fp8_linear, outplace_fused_experts and inplace_fused_experts functions.
and add from typing import List in the head of files.
If you're looking for a smooth way to deploy vLLM with 4-bit quantization, here's a solid base setup using Docker, NVIDIA's optimized image and bitsandbytes.
We're starting from this base image:
nvcr.io/nvidia/tritonserver:25.05-vllm-python-py3
To get bitsandbytes working inside it, we just add a few system dependencies and install the library via pip.
๐ ๏ธ Dockerfile
FROM nvcr.io/nvidia/tritonserver:25.05-vllm-python-py3
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && \
apt-get install -y \
build-essential \
python3-dev \
libopenmpi-dev \
&& rm -rf /var/lib/apt/lists/*
RUN pip3 install --no-cache-dir bitsandbytes --break-system-packages
RUN python3 -c "import bitsandbytes as bnb; print('BitsAndBytes version:', bnb.__version__)"
This setup makes sure everything is in place to run 4-bit quantized models.
๐งช Run the container Once the image is built, you can run inference on a model like phi4 using:
docker run --gpus all \
-v /path/local/models:/models \
-p 8000:8000 \
--shm-size=1g \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--rm \
your_image_name \
vllm serve \
/models/phi4 \
--port 8000 \
--dtype auto \
--max-model-len 12000
--block_size 16
--served-model-name phi4:14b
This will expose the OpenAI-compatible API at http://IP:8000/v1/chat/completions.
Note
Comparing:
phi4-14B-Q4_K_M via Ollama
vs. unsloth/phi-4-unsloth-bnb-4bit in vLLM
I found the Ollama version runs slightly faster โ which is a bit annoying ๐ . Still, vLLM gives us more flexibility and is much easier to scale.
happy to trade tips!
I had this issue because of pytorch version
I'm using 2 AMD Radeon 7900 XTX video cards and I get this error when starting any model:
vllm serve /app/model/Qwen2.5-Coder-14B-Instruct --port 8002 --tensor-parallel-size 2 --gpu-memory-utilization 0.9 --max-model-len 64000
INFO 07-14 11:46:37 [__init__.py:253] Automatically detected platform rocm.
Traceback (most recent call last):
File "/usr/local/bin/vllm", line 5, in <module>
from vllm.entrypoints.cli.main import main
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/__init__.py", line 4, in <module>
from vllm.entrypoints.cli.benchmark.serve import BenchmarkServingSubcommand
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/benchmark/serve.py", line 5, in <module>
from vllm.benchmarks.serve import add_cli_args, main
File "/usr/local/lib/python3.12/dist-packages/vllm/benchmarks/serve.py", line 35, in <module>
from vllm.benchmarks.datasets import (SampleRequest, add_dataset_parser,
File "/usr/local/lib/python3.12/dist-packages/vllm/benchmarks/datasets.py", line 31, in <module>
from vllm.lora.utils import get_adapter_absolute_path
File "/usr/local/lib/python3.12/dist-packages/vllm/lora/utils.py", line 37, in <module>
from vllm.model_executor.models.utils import WeightsMapper
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 17, in <module>
from vllm.model_executor.model_loader.weight_utils import default_weight_loader
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 10, in <module>
from vllm.model_executor.model_loader.bitsandbytes_loader import (
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/bitsandbytes_loader.py", line 23, in <module>
from vllm.model_executor.layers.fused_moe import FusedMoE
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/__init__.py", line 8, in <module>
from vllm.model_executor.layers.fused_moe.layer import (
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 26, in <module>
from vllm.model_executor.layers.fused_moe.modular_kernel import (
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/modular_kernel.py", line 13, in <module>
from vllm.model_executor.layers.fused_moe.utils import ( # yapf: disable
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/utils.py", line 9, in <module>
from vllm.model_executor.layers.quantization.utils.fp8_utils import (
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/fp8_utils.py", line 78, in <module>
direct_register_custom_op(
File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 2492, in direct_register_custom_op
schema_str = torch.library.infer_schema(op_func,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/_library/infer_schema.py", line 106, in infer_schema
error_fn(
File "/usr/local/lib/python3.12/dist-packages/torch/_library/infer_schema.py", line 58, in error_fn
raise ValueError(
ValueError: infer_schema(func): Parameter block_size has unsupported type list[int]. The valid types are: dict_keys([<class 'torch.Tensor'>, typing.Optional[torch.Tensor], typing.Sequence[torch.Tensor], typing.List[torch.Tensor], typing.Sequence[typing.Optional[torch.Tensor]], typing.List[typing.Optional[torch.Tensor]], <class 'int'>, typing.Optional[int], typing.Sequence[int], typing.List[int], typing.Optional[typing.Sequence[int]], typing.Optional[typing.List[int]], <class 'float'>, typing.Optional[float], typing.Sequence[float], typing.List[float], typing.Optional[typing.Sequence[float]], typing.Optional[typing.List[float]], <class 'bool'>, typing.Optional[bool], typing.Sequence[bool], typing.List[bool], typing.Optional[typing.Sequence[bool]], typing.Optional[typing.List[bool]], <class 'str'>, typing.Optional[str], typing.Union[int, float, bool], typing.Union[int, float, bool, NoneType], typing.Sequence[typing.Union[int, float, bool]], typing.List[typing.Union[int, float, bool]], <class 'torch.dtype'>, typing.Optional[torch.dtype], <class 'torch.device'>, typing.Optional[torch.device]]). Got func with signature (A: torch.Tensor, B: torch.Tensor, As: torch.Tensor, Bs: torch.Tensor, block_size: list[int], output_dtype: torch.dtype = torch.float16) -> torch.Tensor)
I'm using 2 AMD Radeon 7900 XTX video cards and I get this error when starting any model:
vllm serve /app/model/Qwen2.5-Coder-14B-Instruct --port 8002 --tensor-parallel-size 2 --gpu-memory-utilization 0.9 --max-model-len 64000 INFO 07-14 11:46:37 [__init__.py:253] Automatically detected platform rocm. Traceback (most recent call last): File "/usr/local/bin/vllm", line 5, in <module> from vllm.entrypoints.cli.main import main File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/__init__.py", line 4, in <module> from vllm.entrypoints.cli.benchmark.serve import BenchmarkServingSubcommand File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/benchmark/serve.py", line 5, in <module> from vllm.benchmarks.serve import add_cli_args, main File "/usr/local/lib/python3.12/dist-packages/vllm/benchmarks/serve.py", line 35, in <module> from vllm.benchmarks.datasets import (SampleRequest, add_dataset_parser, File "/usr/local/lib/python3.12/dist-packages/vllm/benchmarks/datasets.py", line 31, in <module> from vllm.lora.utils import get_adapter_absolute_path File "/usr/local/lib/python3.12/dist-packages/vllm/lora/utils.py", line 37, in <module> from vllm.model_executor.models.utils import WeightsMapper File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 17, in <module> from vllm.model_executor.model_loader.weight_utils import default_weight_loader File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 10, in <module> from vllm.model_executor.model_loader.bitsandbytes_loader import ( File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/bitsandbytes_loader.py", line 23, in <module> from vllm.model_executor.layers.fused_moe import FusedMoE File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/__init__.py", line 8, in <module> from vllm.model_executor.layers.fused_moe.layer import ( File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 26, in <module> from vllm.model_executor.layers.fused_moe.modular_kernel import ( File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/modular_kernel.py", line 13, in <module> from vllm.model_executor.layers.fused_moe.utils import ( # yapf: disable File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/utils.py", line 9, in <module> from vllm.model_executor.layers.quantization.utils.fp8_utils import ( File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/fp8_utils.py", line 78, in <module> direct_register_custom_op( File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 2492, in direct_register_custom_op schema_str = torch.library.infer_schema(op_func, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/_library/infer_schema.py", line 106, in infer_schema error_fn( File "/usr/local/lib/python3.12/dist-packages/torch/_library/infer_schema.py", line 58, in error_fn raise ValueError( ValueError: infer_schema(func): Parameter block_size has unsupported type list[int]. The valid types are: dict_keys([<class 'torch.Tensor'>, typing.Optional[torch.Tensor], typing.Sequence[torch.Tensor], typing.List[torch.Tensor], typing.Sequence[typing.Optional[torch.Tensor]], typing.List[typing.Optional[torch.Tensor]], <class 'int'>, typing.Optional[int], typing.Sequence[int], typing.List[int], typing.Optional[typing.Sequence[int]], typing.Optional[typing.List[int]], <class 'float'>, typing.Optional[float], typing.Sequence[float], typing.List[float], typing.Optional[typing.Sequence[float]], typing.Optional[typing.List[float]], <class 'bool'>, typing.Optional[bool], typing.Sequence[bool], typing.List[bool], typing.Optional[typing.Sequence[bool]], typing.Optional[typing.List[bool]], <class 'str'>, typing.Optional[str], typing.Union[int, float, bool], typing.Union[int, float, bool, NoneType], typing.Sequence[typing.Union[int, float, bool]], typing.List[typing.Union[int, float, bool]], <class 'torch.dtype'>, typing.Optional[torch.dtype], <class 'torch.device'>, typing.Optional[torch.device]]). Got func with signature (A: torch.Tensor, B: torch.Tensor, As: torch.Tensor, Bs: torch.Tensor, block_size: list[int], output_dtype: torch.dtype = torch.float16) -> torch.Tensor)
same issue here
Fix for me:
DOCKER_BUILDKIT=1 docker build --build-arg BASE_IMAGE="rocm/vllm-dev:rocm6.4.1_navi_ubuntu24.04_py3.12_pytorch_2.7_vllm_0.8.5" -f docker/Dockerfile.rocm -t vllm-rocm:10.0rc .
and execute:
pip install numpy==1.26.4 in docker container
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!