vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Bug]: Qwen2.5-VL-72B-Instruct-AWQ error with TP=2 and low throughput (~2 tokens/s) on VLLM_USE_V1=1

Open jlia0 opened this issue 10 months ago • 5 comments

Your current environment

My hardware is 2xA100 (80GB).

The AWQ model works on TP=1 with 1 A100, but the performance is very bad and slow (~2 token/s) while using V1.

I have uploaded the Qwen2.5-VL-72B-InstructAWQ model here: https://huggingface.co/PointerHQ/Qwen2.5-VL-72B-Instruct-Pointer-AWQ

Would really appreciate any help on TP or the slow performance!!

Feb 09  22:48:15.741 (VllmWorkerProcess pid=60) ERROR 02-09 14:48:15 multiproc_worker_utils.py:242] ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.
Feb 09  22:48:15.792 ERROR 02-09 14:48:15 engine.py:389] ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.
Feb 09  22:48:15.898 Process SpawnProcess-1:
Feb 09  22:48:15.898 ERROR 02-09 14:48:15 multiproc_worker_utils.py:124] Worker VllmWorkerProcess pid 60 died, exit code: -15
Feb 09  22:48:15.903 Traceback (most recent call last): File "/usr/local/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/local/lib/python3.11/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/vllm/vllm/engine/multiprocessing/engine.py", line 391, in run_mp_engine raise e File "/vllm/vllm/engine/multiprocessing/engine.py", line 380, in run_mp_engine engine = MQLLMEngine.from_engine_args(engine_args=engine_args, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/engine/multiprocessing/engine.py", line 123, in from_engine_args return cls(ipc_path=ipc_path, ^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/engine/multiprocessing/engine.py", line 75, in init self.engine = LLMEngine(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/engine/llm_engine.py", line 273, in init self.model_executor = executor_class(vllm_config=vllm_config, ) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/executor/executor_base.py", line 262, in init super().init(*args, **kwargs) File "/vllm/vllm/executor/executor_base.py", line 51, in init self._init_executor() File "/vllm/vllm/executor/mp_distributed_executor.py", line 125, in _init_executor self._run_workers("load_model", File "/vllm/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers driver_worker_output = run_method(self.driver_worker, sent_method, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/utils.py", line 2220, in run_method return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/worker/worker.py", line 183, in load_model self.model_runner.load_model() File "/vllm/vllm/worker/model_runner.py", line 1111, in load_model self.model = get_model(vllm_config=self.vllm_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/model_executor/model_loader/init.py", line 14, in get_model return loader.load_model(vllm_config=vllm_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/model_executor/model_loader/loader.py", line 383, in load_model model = _initialize_model(vllm_config=vllm_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/model_executor/model_loader/loader.py", line 125, in _initialize_model return model_class(vllm_config=vllm_config, prefix=prefix) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/model_executor/models/qwen2_5_vl.py", line 810, in init self.language_model = init_vllm_registered_model( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/model_executor/models/utils.py", line 260, in init_vllm_registered_model return _initialize_model(vllm_config=vllm_config, prefix=prefix) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/model_executor/model_loader/loader.py", line 125, in _initialize_model return model_class(vllm_config=vllm_config, prefix=prefix) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/model_executor/models/qwen2.py", line 453, in init self.model = Qwen2Model(vllm_config=vllm_config, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/compilation/decorators.py", line 151, in init old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs) File "/vllm/vllm/model_executor/models/qwen2.py", line 307, in init self.start_layer, self.end_layer, self.layers = make_layers( ^^^^^^^^^^^^ File "/vllm/vllm/model_executor/models/utils.py", line 557, in make_layers [PPMissingLayer() for _ in range(start_layer)] + [ ^ File "/vllm/vllm/model_executor/models/utils.py", line 558, in maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}")) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/model_executor/models/qwen2.py", line 309, in lambda prefix: Qwen2DecoderLayer(config=config, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/model_executor/models/qwen2.py", line 220, in init self.mlp = Qwen2MLP( ^^^^^^^^^ File "/vllm/vllm/model_executor/models/qwen2.py", line 82, in init self.down_proj = RowParallelLinear( ^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/model_executor/layers/linear.py", line 1054, in init self.quant_method.create_weights( File "/vllm/vllm/model_executor/layers/quantization/awq.py", line 103, in create_weights raise ValueError( ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.

🐛 Describe the bug

    "vllm", "serve",
    f"{MODELS_DIR}/{MODEL_NAME}",
    "--host", "127.0.0.1",
    "--port", "8000",
    "--max-model-len", "32767",
    "--max-num-batched-tokens", "32767",
    "--limit-mm-per-prompt", "image=4",
    "--tensor-parallel-size", "2",
    "--gpu-memory-utilization", "0.90",
    "--trust-remote-code",
    "--dtype", "float16",
    "--quantization", "awq",

Before submitting a new issue...

  • [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

jlia0 avatar Feb 09 '25 14:02 jlia0

Thanks for putting that model on huggingface!

I am not sure if I want to create a new issue for it, but we got a different bug with your checkpoint. When trying to run on V0 with --tensor-parallel-size 2, we get:

ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.

in either awq-marlin or awq quantisations. I wonder if this is somehow related to the bug in question?

nFunctor avatar Feb 10 '25 14:02 nFunctor

Thanks for putting that model on huggingface!

I am not sure if I want to create a new issue for it, but we got a different bug with your checkpoint. When trying to run on V0 with --tensor-parallel-size 2, we get:

ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.

in either awq-marlin or awq quantisations. I wonder if this is somehow related to the bug in question?

Is there a QWEN 2.5-VL-72B-INSTRUCT-AWQ model that can be supported? Please provide support for it if possible. I would be extremely grateful.

moshilangzi avatar Feb 13 '25 14:02 moshilangzi

same problem

qianchen94 avatar Feb 14 '25 02:02 qianchen94

i got this problem too

ZakharovNerd avatar Feb 16 '25 10:02 ZakharovNerd

Also hitting this issue. Tensor parallel fix would be nice for speedy inference, but note that you can run this with --pipeline-parallel-size

bbss avatar Feb 16 '25 16:02 bbss

me too.....

EvanSong77 avatar Feb 20 '25 01:02 EvanSong77

@moshilangzi @qianchen94 @ZakharovNerd @bbss @EvanSong77

Try PointerHQ/Qwen2.5-VL-72B-Instruct-Pointer-AWQ which supports --tensor-parallel on 2, 4 and 8 GPUs.

jlia0 avatar Feb 21 '25 15:02 jlia0

@moshilangzi @qianchen94 @ZakharovNerd @bbss @EvanSong77

Try PointerHQ/Qwen2.5-VL-72B-Instruct-Pointer-AWQ which supports --tensor-parallel on 2, 4 and 8 GPUs.

thanks

EvanSong77 avatar Feb 23 '25 06:02 EvanSong77

Qwen2.5-VL-72B-Instruct-Pointer-AWQ

no longer available , this model

linchen111 avatar Mar 14 '25 03:03 linchen111

@moshilangzi @qianchen94 @ZakharovNerd @bbss @EvanSong77

Try PointerHQ/Qwen2.5-VL-72B-Instruct-Pointer-AWQ which supports --tensor-parallel on 2, 4 and 8 GPUs.

How can I get this model, it's invalid

philipwan avatar Mar 14 '25 06:03 philipwan

@moshilangzi @qianchen94 @ZakharovNerd @bbss @EvanSong77 Try PointerHQ/Qwen2.5-VL-72B-Instruct-Pointer-AWQ which supports --tensor-parallel on 2, 4 and 8 GPUs.

How can I get this model, it's invalid

you can use https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct-AWQ

EvanSong77 avatar Mar 14 '25 06:03 EvanSong77

Increase the number of batched tokens to 81920. That way you are allocating more memory on kv cache. The it will process minimum 3 requests concurrently. Consider gpu utilisation to 0.95