vllm [Bug]: Qwen2.5-VL-72B-Instruct-AWQ error with TP=2 and low throughput (~2 tokens/s) on VLLM_USE

Your current environment

My hardware is 2xA100 (80GB).

The AWQ model works on TP=1 with 1 A100, but the performance is very bad and slow (~2 token/s) while using V1.

I have uploaded the Qwen2.5-VL-72B-InstructAWQ model here: https://huggingface.co/PointerHQ/Qwen2.5-VL-72B-Instruct-Pointer-AWQ

Would really appreciate any help on TP or the slow performance!!

Feb 09 22:48:15.741	(VllmWorkerProcess pid=60) ERROR 02-09 14:48:15 multiproc_worker_utils.py:242] ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.
Feb 09 22:48:15.792	ERROR 02-09 14:48:15 engine.py:389] ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.
Feb 09 22:48:15.898	Process SpawnProcess-1:
Feb 09 22:48:15.898	ERROR 02-09 14:48:15 multiproc_worker_utils.py:124] Worker VllmWorkerProcess pid 60 died, exit code: -15
Feb 09 22:48:15.903	Traceback (most recent call last): File "/usr/local/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/local/lib/python3.11/multiprocessing/process.py", line 108, in run self._target(self._args, self._kwargs) File "/vllm/vllm/engine/multiprocessing/engine.py", line 391, in run_mp_engine raise e File "/vllm/vllm/engine/multiprocessing/engine.py", line 380, in run_mp_engine engine = MQLLMEngine.from_engine_args(engine_args=engine_args, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/engine/multiprocessing/engine.py", line 123, in from_engine_args return cls(ipc_path=ipc_path, ^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/engine/multiprocessing/engine.py", line 75, in init* self.engine = LLMEngine(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/engine/llm_engine.py", line 273, in init* self.model_executor = executor_class(vllm_config=vllm_config, ) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/executor/executor_base.py", line 262, in init super().init(args, kwargs) File "/vllm/vllm/executor/executor_base.py", line 51, in init* self._init_executor() File "/vllm/vllm/executor/mp_distributed_executor.py", line 125, in _init_executor self._run_workers("load_model", File "/vllm/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers driver_worker_output = run_method(self.driver_worker, sent_method, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/utils.py", line 2220, in run_method return func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/worker/worker.py", line 183, in load_model self.model_runner.load_model() File "/vllm/vllm/worker/model_runner.py", line 1111, in load_model self.model = get_model(vllm_config=self.vllm_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/model_executor/model_loader/init.py", line 14, in get_model return loader.load_model(vllm_config=vllm_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/model_executor/model_loader/loader.py", line 383, in load_model model = _initialize_model(vllm_config=vllm_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/model_executor/model_loader/loader.py", line 125, in _initialize_model return model_class(vllm_config=vllm_config, prefix=prefix) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/model_executor/models/qwen2_5_vl.py", line 810, in init* self.language_model = init_vllm_registered_model( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/model_executor/models/utils.py", line 260, in init_vllm_registered_model return _initialize_model(vllm_config=vllm_config, prefix=prefix) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/model_executor/model_loader/loader.py", line 125, in _initialize_model return model_class(vllm_config=vllm_config, prefix=prefix) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/model_executor/models/qwen2.py", line 453, in init self.model = Qwen2Model(vllm_config=vllm_config, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/compilation/decorators.py", line 151, in init old_init(self, vllm_config=vllm_config, prefix=prefix, kwargs) File "/vllm/vllm/model_executor/models/qwen2.py", line 307, in init self.start_layer, self.end_layer, self.layers = make_layers( ^^^^^^^^^^^^ File "/vllm/vllm/model_executor/models/utils.py", line 557, in make_layers [PPMissingLayer() for _ in range(start_layer)] + [ ^ File "/vllm/vllm/model_executor/models/utils.py", line 558, in maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}")) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/model_executor/models/qwen2.py", line 309, in lambda prefix: Qwen2DecoderLayer(config=config, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/model_executor/models/qwen2.py", line 220, in init self.mlp = Qwen2MLP( ^^^^^^^^^ File "/vllm/vllm/model_executor/models/qwen2.py", line 82, in init self.down_proj = RowParallelLinear( ^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/model_executor/layers/linear.py", line 1054, in init** self.quant_method.create_weights( File "/vllm/vllm/model_executor/layers/quantization/awq.py", line 103, in create_weights raise ValueError( ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.

🐛 Describe the bug

    "vllm", "serve",
    f"{MODELS_DIR}/{MODEL_NAME}",
    "--host", "127.0.0.1",
    "--port", "8000",
    "--max-model-len", "32767",
    "--max-num-batched-tokens", "32767",
    "--limit-mm-per-prompt", "image=4",
    "--tensor-parallel-size", "2",
    "--gpu-memory-utilization", "0.90",
    "--trust-remote-code",
    "--dtype", "float16",
    "--quantization", "awq",

Before submitting a new issue...

[x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Feb 09 '25 14:02 jlia0

Thanks for putting that model on huggingface!

I am not sure if I want to create a new issue for it, but we got a different bug with your checkpoint. When trying to run on V0 with --tensor-parallel-size 2, we get:

ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.

in either awq-marlin or awq quantisations. I wonder if this is somehow related to the bug in question?

Feb 10 '25 14:02 nFunctor

Thanks for putting that model on huggingface!

I am not sure if I want to create a new issue for it, but we got a different bug with your checkpoint. When trying to run on V0 with --tensor-parallel-size 2, we get:

ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.

in either awq-marlin or awq quantisations. I wonder if this is somehow related to the bug in question?

Is there a QWEN 2.5-VL-72B-INSTRUCT-AWQ model that can be supported? Please provide support for it if possible. I would be extremely grateful.

Feb 13 '25 14:02 moshilangzi

same problem

Feb 14 '25 02:02 qianchen94

i got this problem too

Feb 16 '25 10:02 ZakharovNerd

Also hitting this issue. Tensor parallel fix would be nice for speedy inference, but note that you can run this with --pipeline-parallel-size

Feb 16 '25 16:02 bbss

me too.....

Feb 20 '25 01:02 EvanSong77

@moshilangzi @qianchen94 @ZakharovNerd @bbss @EvanSong77

Try PointerHQ/Qwen2.5-VL-72B-Instruct-Pointer-AWQ which supports --tensor-parallel on 2, 4 and 8 GPUs.

Feb 21 '25 15:02 jlia0

@moshilangzi @qianchen94 @ZakharovNerd @bbss @EvanSong77

Try PointerHQ/Qwen2.5-VL-72B-Instruct-Pointer-AWQ which supports --tensor-parallel on 2, 4 and 8 GPUs.

thanks

Feb 23 '25 06:02 EvanSong77

Qwen2.5-VL-72B-Instruct-Pointer-AWQ

no longer available , this model

Mar 14 '25 03:03 linchen111

@moshilangzi @qianchen94 @ZakharovNerd @bbss @EvanSong77

Try PointerHQ/Qwen2.5-VL-72B-Instruct-Pointer-AWQ which supports --tensor-parallel on 2, 4 and 8 GPUs.

How can I get this model, it's invalid

Mar 14 '25 06:03 philipwan

@moshilangzi @qianchen94 @ZakharovNerd @bbss @EvanSong77 Try PointerHQ/Qwen2.5-VL-72B-Instruct-Pointer-AWQ which supports --tensor-parallel on 2, 4 and 8 GPUs.

How can I get this model, it's invalid

you can use https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct-AWQ

Mar 14 '25 06:03 EvanSong77

Increase the number of batched tokens to 81920. That way you are allocating more memory on kv cache. The it will process minimum 3 requests concurrently. Consider gpu utilisation to 0.95

Apr 09 '25 01:04 Dineshkumar-Anandan-ZS0367

[Bug]: Qwen2.5-VL-72B-Instruct-AWQ error with TP=2 and low throughput (~2 tokens/s) on VLLM_USE_V1=1

Your current environment

🐛 Describe the bug

Before submitting a new issue...