vllm [Bug]: Unable to infer QLoRA adapter using vLLM Docker

Your current environment

Docker version: 0.6.3 Base model id: TheBloke/WizardLM-13B-V1.2-GPTQ revision: gptq-8bit-128g-actorder_False LoRA Adapter: It is a custom adapter based on QLORA OS Version: Ubuntu 22.04

Model Input Dumps

No response

🐛 Describe the bug

docker run --name ocr_llm --gpus all --shm-size 1g -p 8010:8000 -v $volume:/data vllm/vllm-openai:latest --model $model --enable-lora --lora-modules $LORA_ADAPTERS --quantization gptq --gpu-memory-utilization 0.95

curl http://localhost:8010/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
	"model": "adapter_id",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }' | jq

When I tried to run the curl command, vllm docker crashed while giving the following error:

2024-10-16T03:44:47.177951962Z INFO:     172.17.0.1:50230 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
2024-10-16T03:44:47.183535781Z INFO:     Shutting down
2024-10-16T03:44:47.190118292Z ERROR 10-15 20:44:47 engine.py:160] RuntimeError('Error in model execution: Loading lora /data/adapters/container/checkpoint-2000 failed')
2024-10-16T03:44:47.190147783Z ERROR 10-15 20:44:47 engine.py:160] Traceback (most recent call last):
2024-10-16T03:44:47.190153223Z ERROR 10-15 20:44:47 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/lora/worker_manager.py", line 94, in _load_adapter
2024-10-16T03:44:47.190158163Z ERROR 10-15 20:44:47 engine.py:160]     lora = self._lora_model_cls.from_local_checkpoint(
2024-10-16T03:44:47.190162793Z ERROR 10-15 20:44:47 engine.py:160]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-10-16T03:44:47.190167313Z ERROR 10-15 20:44:47 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/lora/models.py", line 218, in from_local_checkpoint
2024-10-16T03:44:47.190172103Z ERROR 10-15 20:44:47 engine.py:160]     module_name, _ = parse_fine_tuned_lora_name(lora_module)
2024-10-16T03:44:47.190176683Z ERROR 10-15 20:44:47 engine.py:160]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-10-16T03:44:47.190181333Z ERROR 10-15 20:44:47 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/lora/utils.py", line 114, in parse_fine_tuned_lora_name
2024-10-16T03:44:47.190186093Z ERROR 10-15 20:44:47 engine.py:160]     raise ValueError(f"{name} is unsupported LoRA weight")
2024-10-16T03:44:47.190202703Z ERROR 10-15 20:44:47 engine.py:160] ValueError: base_model.model.lm_head.base_layer.weight is unsupported LoRA weight
2024-10-16T03:44:47.190208063Z ERROR 10-15 20:44:47 engine.py:160] 
2024-10-16T03:44:47.190213573Z ERROR 10-15 20:44:47 engine.py:160] The above exception was the direct cause of the following exception:
2024-10-16T03:44:47.190218814Z ERROR 10-15 20:44:47 engine.py:160] 
2024-10-16T03:44:47.190223154Z ERROR 10-15 20:44:47 engine.py:160] Traceback (most recent call last):
2024-10-16T03:44:47.190228604Z ERROR 10-15 20:44:47 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
2024-10-16T03:44:47.190234504Z ERROR 10-15 20:44:47 engine.py:160]     return func(*args, **kwargs)
2024-10-16T03:44:47.190239324Z ERROR 10-15 20:44:47 engine.py:160]            ^^^^^^^^^^^^^^^^^^^^^
2024-10-16T03:44:47.190243794Z ERROR 10-15 20:44:47 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1626, in execute_model
2024-10-16T03:44:47.190248484Z ERROR 10-15 20:44:47 engine.py:160]     self.set_active_loras(model_input.lora_requests,
2024-10-16T03:44:47.190254034Z ERROR 10-15 20:44:47 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1322, in set_active_loras
2024-10-16T03:44:47.190258684Z ERROR 10-15 20:44:47 engine.py:160]     self.lora_manager.set_active_adapters(lora_requests, lora_mapping)
2024-10-16T03:44:47.190263164Z ERROR 10-15 20:44:47 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/lora/worker_manager.py", line 136, in set_active_adapters
2024-10-16T03:44:47.190267794Z ERROR 10-15 20:44:47 engine.py:160]     set_active_adapters_worker(requests, mapping, self._apply_adapters,
2024-10-16T03:44:47.190272244Z ERROR 10-15 20:44:47 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/adapter_commons/utils.py", line 52, in set_active_adapters_worker
2024-10-16T03:44:47.190276894Z ERROR 10-15 20:44:47 engine.py:160]     apply_adapters_func(requests)
2024-10-16T03:44:47.190281274Z ERROR 10-15 20:44:47 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/lora/worker_manager.py", line 195, in _apply_adapters
2024-10-16T03:44:47.190285844Z ERROR 10-15 20:44:47 engine.py:160]     self.add_adapter(lora)
2024-10-16T03:44:47.190290204Z ERROR 10-15 20:44:47 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/lora/worker_manager.py", line 204, in add_adapter
2024-10-16T03:44:47.190294794Z ERROR 10-15 20:44:47 engine.py:160]     lora = self._load_adapter(lora_request)
2024-10-16T03:44:47.190299235Z ERROR 10-15 20:44:47 engine.py:160]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-10-16T03:44:47.190303695Z ERROR 10-15 20:44:47 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/lora/worker_manager.py", line 107, in _load_adapter
2024-10-16T03:44:47.190308345Z ERROR 10-15 20:44:47 engine.py:160]     raise RuntimeError(f"Loading lora {lora_path} failed") from e
2024-10-16T03:44:47.190318195Z ERROR 10-15 20:44:47 engine.py:160] RuntimeError: Loading lora /data/adapters/container/checkpoint-2000 failed
2024-10-16T03:44:47.190322725Z ERROR 10-15 20:44:47 engine.py:160] 
2024-10-16T03:44:47.190327075Z ERROR 10-15 20:44:47 engine.py:160] The above exception was the direct cause of the following exception:
2024-10-16T03:44:47.190331575Z ERROR 10-15 20:44:47 engine.py:160] 
2024-10-16T03:44:47.190335925Z ERROR 10-15 20:44:47 engine.py:160] Traceback (most recent call last):
2024-10-16T03:44:47.190340285Z ERROR 10-15 20:44:47 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 158, in start
2024-10-16T03:44:47.190344875Z ERROR 10-15 20:44:47 engine.py:160]     self.run_engine_loop()
2024-10-16T03:44:47.190349575Z ERROR 10-15 20:44:47 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 221, in run_engine_loop
2024-10-16T03:44:47.190354275Z ERROR 10-15 20:44:47 engine.py:160]     request_outputs = self.engine_step()
2024-10-16T03:44:47.190358625Z ERROR 10-15 20:44:47 engine.py:160]                       ^^^^^^^^^^^^^^^^^^
2024-10-16T03:44:47.190362975Z ERROR 10-15 20:44:47 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 239, in engine_step
2024-10-16T03:44:47.190367575Z ERROR 10-15 20:44:47 engine.py:160]     raise e
2024-10-16T03:44:47.190371895Z ERROR 10-15 20:44:47 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 230, in engine_step
2024-10-16T03:44:47.190376525Z ERROR 10-15 20:44:47 engine.py:160]     return self.engine.step()
2024-10-16T03:44:47.190381126Z ERROR 10-15 20:44:47 engine.py:160]            ^^^^^^^^^^^^^^^^^^
2024-10-16T03:44:47.190385636Z ERROR 10-15 20:44:47 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1386, in step
2024-10-16T03:44:47.190390296Z ERROR 10-15 20:44:47 engine.py:160]     outputs = self.model_executor.execute_model(
2024-10-16T03:44:47.190394756Z ERROR 10-15 20:44:47 engine.py:160]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-10-16T03:44:47.190399186Z ERROR 10-15 20:44:47 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/gpu_executor.py", line 134, in execute_model
2024-10-16T03:44:47.190403776Z ERROR 10-15 20:44:47 engine.py:160]     output = self.driver_worker.execute_model(execute_model_req)
2024-10-16T03:44:47.190408186Z ERROR 10-15 20:44:47 engine.py:160]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-10-16T03:44:47.190412556Z ERROR 10-15 20:44:47 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 327, in execute_model
2024-10-16T03:44:47.190417166Z ERROR 10-15 20:44:47 engine.py:160]     output = self.model_runner.execute_model(
2024-10-16T03:44:47.190421656Z ERROR 10-15 20:44:47 engine.py:160]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-10-16T03:44:47.190431216Z ERROR 10-15 20:44:47 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
2024-10-16T03:44:47.190435876Z ERROR 10-15 20:44:47 engine.py:160]     return func(*args, **kwargs)
2024-10-16T03:44:47.190440256Z ERROR 10-15 20:44:47 engine.py:160]            ^^^^^^^^^^^^^^^^^^^^^
2024-10-16T03:44:47.190444976Z ERROR 10-15 20:44:47 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 146, in _wrapper
2024-10-16T03:44:47.190449596Z ERROR 10-15 20:44:47 engine.py:160]     raise type(err)(f"Error in model execution: "
2024-10-16T03:44:47.190454156Z ERROR 10-15 20:44:47 engine.py:160] RuntimeError: Error in model execution: Loading lora /data/adapters/container/checkpoint-2000 failed
2024-10-16T03:44:47.284020691Z INFO:     Waiting for application shutdown.
2024-10-16T03:44:47.284088552Z INFO:     Application shutdown complete.
2024-10-16T03:44:47.284718440Z INFO:     Finished server process [1]

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Oct 16 '24 04:10 SMAntony

It looks like there's a problem with your LoRA weights or you've loaded incorrect weights

2024-10-16T03:44:47.190186093Z ERROR 10-15 20:44:47 engine.py:160]     raise ValueError(f"{name} is unsupported LoRA weight")
2024-10-16T03:44:47.190202703Z ERROR 10-15 20:44:47 engine.py:160] ValueError: base_model.model.lm_head.base_layer.weight is unsupported LoRA weight

Oct 16 '24 04:10 jeejeelee

It looks like there's a problem with your LoRA weights or you've loaded incorrect weights

2024-10-16T03:44:47.190186093Z ERROR 10-15 20:44:47 engine.py:160]     raise ValueError(f"{name} is unsupported LoRA weight")
2024-10-16T03:44:47.190202703Z ERROR 10-15 20:44:47 engine.py:160] ValueError: base_model.model.lm_head.base_layer.weight is unsupported LoRA weight

I am able to load and infer the adapter using transformers (with the help of peft). Is it wrong to expect it should work with vllm also?

Oct 18 '24 16:10 SMAntony

Does vLLM not support QLORA for GPTQ models yet?

Oct 18 '24 16:10 SMAntony

Does vLLM not support QLORA for GPTQ models yet?

vLLM does support QLORA for GPTQ models.

Oct 19 '24 02:10 jeejeelee

Does vLLM not support QLORA for GPTQ models yet?

vLLM does support QLORA for GPTQ models.

If the loading of adapter works with Transformers, then it should also work with vLLM, right?

Oct 19 '24 02:10 SMAntony

I think mine is a duplicate of #4186 . Looks like #8082 will fix it.

Oct 19 '24 02:10 SMAntony

You mean your lora config includes modules_to_save, right? If that's the case, the current vllm doesn't support this

Oct 21 '24 03:10 jeejeelee

If the loading of adapter works with Transformers, then it should also work with vLLM, right?

That's not it. For example, vllm currently doesn't support things like dora.

Oct 21 '24 03:10 jeejeelee

You mean your lora config includes modules_to_save, right? If that's the case, the current vllm doesn't support this

Yes, Sir. Am I right in saying 8082 will fix it? Thanks for the help, I will close the issue.

Oct 21 '24 07:10 SMAntony

see: https://github.com/vllm-project/vllm/issues/4186#issuecomment-2241307064

Oct 21 '24 08:10 jeejeelee

see: #4186 (comment)

But if I merge the LoRA adapter, I cannot use multiple LoRAs, right?

Oct 21 '24 12:10 SMAntony

see: #4186 (comment)

But if I merge the LoRA adapter, I cannot use multiple LoRAs, right?

For your case, yes

Oct 21 '24 15:10 jeejeelee

vllm vllm copied to clipboard

[Bug]: Unable to infer QLoRA adapter using vLLM Docker

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

vllm
vllm copied to clipboard