[Bug]: Qwen2.5-VL-72B-Instruct-AWQ error with TP=2 and low throughput (~2 tokens/s) on VLLM_USE_V1=1
Your current environment
My hardware is 2xA100 (80GB).
The AWQ model works on TP=1 with 1 A100, but the performance is very bad and slow (~2 token/s) while using V1.
I have uploaded the Qwen2.5-VL-72B-InstructAWQ model here: https://huggingface.co/PointerHQ/Qwen2.5-VL-72B-Instruct-Pointer-AWQ
Would really appreciate any help on TP or the slow performance!!
| Feb 09 22:48:15.741 | (VllmWorkerProcess pid=60) ERROR 02-09 14:48:15 multiproc_worker_utils.py:242] ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size. |
|---|---|
| Feb 09 22:48:15.792 | ERROR 02-09 14:48:15 engine.py:389] ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size. |
| Feb 09 22:48:15.898 | Process SpawnProcess-1: |
| Feb 09 22:48:15.898 | ERROR 02-09 14:48:15 multiproc_worker_utils.py:124] Worker VllmWorkerProcess pid 60 died, exit code: -15 |
| Feb 09 22:48:15.903 | Traceback (most recent call last): File "/usr/local/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/local/lib/python3.11/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/vllm/vllm/engine/multiprocessing/engine.py", line 391, in run_mp_engine raise e File "/vllm/vllm/engine/multiprocessing/engine.py", line 380, in run_mp_engine engine = MQLLMEngine.from_engine_args(engine_args=engine_args, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/engine/multiprocessing/engine.py", line 123, in from_engine_args return cls(ipc_path=ipc_path, ^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/engine/multiprocessing/engine.py", line 75, in init self.engine = LLMEngine(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/engine/llm_engine.py", line 273, in init self.model_executor = executor_class(vllm_config=vllm_config, ) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/executor/executor_base.py", line 262, in init super().init(*args, **kwargs) File "/vllm/vllm/executor/executor_base.py", line 51, in init self._init_executor() File "/vllm/vllm/executor/mp_distributed_executor.py", line 125, in _init_executor self._run_workers("load_model", File "/vllm/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers driver_worker_output = run_method(self.driver_worker, sent_method, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/utils.py", line 2220, in run_method return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/worker/worker.py", line 183, in load_model self.model_runner.load_model() File "/vllm/vllm/worker/model_runner.py", line 1111, in load_model self.model = get_model(vllm_config=self.vllm_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/model_executor/model_loader/init.py", line 14, in get_model return loader.load_model(vllm_config=vllm_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/model_executor/model_loader/loader.py", line 383, in load_model model = _initialize_model(vllm_config=vllm_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/model_executor/model_loader/loader.py", line 125, in _initialize_model return model_class(vllm_config=vllm_config, prefix=prefix) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/model_executor/models/qwen2_5_vl.py", line 810, in init self.language_model = init_vllm_registered_model( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/model_executor/models/utils.py", line 260, in init_vllm_registered_model return _initialize_model(vllm_config=vllm_config, prefix=prefix) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/model_executor/model_loader/loader.py", line 125, in _initialize_model return model_class(vllm_config=vllm_config, prefix=prefix) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/model_executor/models/qwen2.py", line 453, in init self.model = Qwen2Model(vllm_config=vllm_config, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/vllm/vllm/compilation/decorators.py", line 151, in init old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs) File "/vllm/vllm/model_executor/models/qwen2.py", line 307, in init self.start_layer, self.end_layer, self.layers = make_layers( ^^^^^^^^^^^^ File "/vllm/vllm/model_executor/models/utils.py", line 557, in make_layers [PPMissingLayer() for _ in range(start_layer)] + [ ^ File "/vllm/vllm/model_executor/models/utils.py", line 558, in |
🐛 Describe the bug
"vllm", "serve",
f"{MODELS_DIR}/{MODEL_NAME}",
"--host", "127.0.0.1",
"--port", "8000",
"--max-model-len", "32767",
"--max-num-batched-tokens", "32767",
"--limit-mm-per-prompt", "image=4",
"--tensor-parallel-size", "2",
"--gpu-memory-utilization", "0.90",
"--trust-remote-code",
"--dtype", "float16",
"--quantization", "awq",
Before submitting a new issue...
- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Thanks for putting that model on huggingface!
I am not sure if I want to create a new issue for it, but we got a different bug with your checkpoint. When trying to run on V0 with --tensor-parallel-size 2, we get:
ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.
in either awq-marlin or awq quantisations. I wonder if this is somehow related to the bug in question?
Thanks for putting that model on huggingface!
I am not sure if I want to create a new issue for it, but we got a different bug with your checkpoint. When trying to run on V0 with
--tensor-parallel-size 2, we get:
ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.in either
awq-marlinorawqquantisations. I wonder if this is somehow related to the bug in question?
Is there a QWEN 2.5-VL-72B-INSTRUCT-AWQ model that can be supported? Please provide support for it if possible. I would be extremely grateful.
same problem
i got this problem too
Also hitting this issue. Tensor parallel fix would be nice for speedy inference, but note that you can run this with --pipeline-parallel-size
me too.....
@moshilangzi @qianchen94 @ZakharovNerd @bbss @EvanSong77
Try PointerHQ/Qwen2.5-VL-72B-Instruct-Pointer-AWQ which supports --tensor-parallel on 2, 4 and 8 GPUs.
@moshilangzi @qianchen94 @ZakharovNerd @bbss @EvanSong77
Try PointerHQ/Qwen2.5-VL-72B-Instruct-Pointer-AWQ which supports --tensor-parallel on 2, 4 and 8 GPUs.
thanks
@moshilangzi @qianchen94 @ZakharovNerd @bbss @EvanSong77
Try PointerHQ/Qwen2.5-VL-72B-Instruct-Pointer-AWQ which supports --tensor-parallel on 2, 4 and 8 GPUs.
How can I get this model, it's invalid
@moshilangzi @qianchen94 @ZakharovNerd @bbss @EvanSong77 Try PointerHQ/Qwen2.5-VL-72B-Instruct-Pointer-AWQ which supports --tensor-parallel on 2, 4 and 8 GPUs.
How can I get this model, it's invalid
you can use https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct-AWQ
Increase the number of batched tokens to 81920. That way you are allocating more memory on kv cache. The it will process minimum 3 requests concurrently. Consider gpu utilisation to 0.95