vllm
vllm copied to clipboard
[Bug]: The size of tensor a (49472) must match the size of tensor b (49664) at non-singleton dimension 1
Your current environment
I'm trying to use a lora adapter along its base model on my CPUs only machine from a docker container.
I first build a local cpu based image, like below:
git checkout v0.8.5
docker build -f docker/Dockerfile.cpu -t vllm-0.8.5-cpu-env --shm-size=4g .
From there, I'm able to run the model and associated lora adapter with success:
docker run -it --mount type=bind,source=/home/agallice/dev/hugging-face-models/,target=/hf-models --rm --network=host vllm-0.8.4-cpu-env --model /hf-models/granite-3.2-8b-instruct --max-model-len 16384 --enable-lora --lora-modules uncertainty-lora=/hf-models/granite-uncertainty-3.2-8b-lora
However, this all break on the first request, even simply requesting the base model ala below:
[main_upstream @ uncertainty-quarkus-experiments]$ curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "/hf-models/granite-3.2-8b-instruct",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'
The served engine crashes with below stack trace:
I've tried to experiment with different setups, command line arguments and so on. To no avail at this stage.
The issue seems to lie on lora_ops.py. I suspect this is a bug in vllm.
Any idea/guidance about this ? I wonder if this is a cpu only kind of issue ?
🐛 Describe the bug
The server stack trace on crash is below:
ERROR 04-29 14:55:14 [engine.py:160] RuntimeError('The size of tensor a (49472) must match the size of tensor b (49664) at non-singleton dimension 1')
ERROR 04-29 14:55:14 [engine.py:160] Traceback (most recent call last):
ERROR 04-29 14:55:14 [engine.py:160] File "/opt/venv/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 158, in start
ERROR 04-29 14:55:14 [engine.py:160] self.run_engine_loop()
ERROR 04-29 14:55:14 [engine.py:160] File "/opt/venv/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 221, in run_engine_loop
ERROR 04-29 14:55:14 [engine.py:160] request_outputs = self.engine_step()
ERROR 04-29 14:55:14 [engine.py:160] ^^^^^^^^^^^^^^^^^^
ERROR 04-29 14:55:14 [engine.py:160] File "/opt/venv/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 247, in engine_step
ERROR 04-29 14:55:14 [engine.py:160] raise e
ERROR 04-29 14:55:14 [engine.py:160] File "/opt/venv/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 230, in engine_step
ERROR 04-29 14:55:14 [engine.py:160] return self.engine.step()
ERROR 04-29 14:55:14 [engine.py:160] ^^^^^^^^^^^^^^^^^^
ERROR 04-29 14:55:14 [engine.py:160] File "/opt/venv/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 1431, in step
ERROR 04-29 14:55:14 [engine.py:160] outputs = self.model_executor.execute_model(
ERROR 04-29 14:55:14 [engine.py:160] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 14:55:14 [engine.py:160] File "/opt/venv/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 299, in execute_model
ERROR 04-29 14:55:14 [engine.py:160] driver_outputs = self._driver_execute_model(execute_model_req)
ERROR 04-29 14:55:14 [engine.py:160] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 14:55:14 [engine.py:160] File "/opt/venv/lib/python3.12/site-packages/vllm/executor/mp_distributed_executor.py", line 144, in _driver_execute_model
ERROR 04-29 14:55:14 [engine.py:160] return self.driver_worker.execute_model(execute_model_req)
ERROR 04-29 14:55:14 [engine.py:160] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 14:55:14 [engine.py:160] File "/opt/venv/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 420, in execute_model
ERROR 04-29 14:55:14 [engine.py:160] output = self.model_runner.execute_model(
ERROR 04-29 14:55:14 [engine.py:160] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 14:55:14 [engine.py:160] File "/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-29 14:55:14 [engine.py:160] return func(*args, **kwargs)
ERROR 04-29 14:55:14 [engine.py:160] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 14:55:14 [engine.py:160] File "/opt/venv/lib/python3.12/site-packages/vllm/worker/cpu_model_runner.py", line 664, in execute_model
ERROR 04-29 14:55:14 [engine.py:160] logits = self.model.compute_logits(hidden_states,
ERROR 04-29 14:55:14 [engine.py:160] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 14:55:14 [engine.py:160] File "/opt/venv/lib/python3.12/site-packages/vllm/model_executor/models/granite.py", line 463, in compute_logits
ERROR 04-29 14:55:14 [engine.py:160] logits = self.logits_processor(self.lm_head, hidden_states,
ERROR 04-29 14:55:14 [engine.py:160] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 14:55:14 [engine.py:160] File "/opt/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-29 14:55:14 [engine.py:160] return self._call_impl(*args, **kwargs)
ERROR 04-29 14:55:14 [engine.py:160] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 14:55:14 [engine.py:160] File "/opt/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-29 14:55:14 [engine.py:160] return forward_call(*args, **kwargs)
ERROR 04-29 14:55:14 [engine.py:160] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 14:55:14 [engine.py:160] File "/opt/venv/lib/python3.12/site-packages/vllm/lora/layers.py", line 1164, in forward
ERROR 04-29 14:55:14 [engine.py:160] return type(self.base_layer).forward(self, *args, **kwargs)
ERROR 04-29 14:55:14 [engine.py:160] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 14:55:14 [engine.py:160] File "/opt/venv/lib/python3.12/site-packages/vllm/model_executor/layers/logits_processor.py", line 70, in forward
ERROR 04-29 14:55:14 [engine.py:160] logits = self._get_logits(hidden_states, lm_head, embedding_bias)
ERROR 04-29 14:55:14 [engine.py:160] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 14:55:14 [engine.py:160] File "/opt/venv/lib/python3.12/site-packages/vllm/lora/layers.py", line 1155, in _get_logits
ERROR 04-29 14:55:14 [engine.py:160] self.punica_wrapper.add_lora_logits(logits, hidden_states,
ERROR 04-29 14:55:14 [engine.py:160] File "/opt/venv/lib/python3.12/site-packages/vllm/lora/punica_wrapper/punica_cpu.py", line 343, in add_lora_logits
ERROR 04-29 14:55:14 [engine.py:160] bgmv_expand(buffer,
ERROR 04-29 14:55:14 [engine.py:160] File "/opt/venv/lib/python3.12/site-packages/vllm/lora/ops/torch_ops/lora_ops.py", line 40, in bgmv_expand
ERROR 04-29 14:55:14 [engine.py:160] output_tensor[:, :outputs.shape[1]] += outputs[:limit, :]
ERROR 04-29 14:55:14 [engine.py:160] RuntimeError: The size of tensor a (49472) must match the size of tensor b (49664) at non-singleton dimension 1
Before submitting a new issue...
- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Looks like a LoRA problem
are you using this lora model?
@jeejeelee Yes, this is the lora model I try to use along with this base model.
I've downloaded both locally.
Hello @jeejeelee, I wonder if the issue could be reproduced on another machine than mine ?
I can reproduce this issue and am working on a fix.
Many thanks for stepping in @frreiss 👍
On my side, I was able to merge the lora + base model on another platform. From there, I'm now able to run the merged model on my CPU only machine with vllm in a docker image.
Please let me know in case you need a beta tester for the investigation or fix.
@aldettinger Can you test if #18773 fix your issue?
@jeejeelee Sure, looks ok.