vllm
vllm copied to clipboard
[Bug]: Not able to do lora inference with phi-3
Your current environment
The output of `python collect_env.py`
🐛 Describe the bug
The following error appeared when trying to do lora inference with phi-3 using the newest vllm version:
Exception while reading stream response: Loading lora data/loras/jt_snc_dpo failed
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 150, in _load_lora
lora = self._lora_model_cls.from_local_checkpoint(
File "/usr/local/lib/python3.10/dist-packages/vllm/lora/models.py", line 225, in from_local_checkpoint
raise ValueError(
ValueError: While loading data/loras/jt_snc_dpo, expected target modules in ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj', 'embed_tokens', 'lm_head'] but received ['gate_up_proj', 'qkv_proj']. Please verify that the loaded LoRA module is correct
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/app/model_wrapper.py", line 269, in write_response_to_queue
async for chunk in generator:
File "/app/model/model.py", line 50, in generator
async for output in vllm_generator:
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 666, in generate
raise e
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 660, in generate
async for request_output in stream:
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 77, in __anext__
raise result
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
task.result()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 501, in run_engine_loop
has_requests_in_progress = await asyncio.wait_for(
File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
return fut.result()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 475, in engine_step
request_outputs = await self.engine.step_async()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 221, in step_async
output = await self.model_executor.execute_model_async(
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 148, in execute_model_async
output = await make_async(self.driver_worker.execute_model
File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 249, in execute_model
output = self.model_runner.execute_model(seq_group_metadata_list,
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 790, in execute_model
self.set_active_loras(lora_requests, lora_mapping)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 901, in set_active_loras
self.lora_manager.set_active_loras(lora_requests, lora_mapping)
File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 113, in set_active_loras
self._apply_loras(lora_requests)
File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 235, in _apply_loras
self.add_lora(lora)
File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 243, in add_lora
lora = self._load_lora(lora_request)
File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 162, in _load_lora
raise RuntimeError(
RuntimeError: Loading lora data/loras/jt_snc_dpo failed
Below is the config file of the adapter:
{
"alpha_pattern": {},
"auto_mapping": null,
"base_model_name_or_path": "microsoft/Phi-3-mini-128k-instruct",
"bias": "none",
"fan_in_fan_out": false,
"inference_mode": true,
"init_lora_weights": true,
"layer_replication": null,
"layers_pattern": null,
"layers_to_transform": null,
"loftq_config": {},
"lora_alpha": 64,
"lora_dropout": 0.1,
"megatron_config": null,
"megatron_core": "megatron.core",
"modules_to_save": null,
"peft_type": "LORA",
"r": 32,
"rank_pattern": {},
"revision": null,
"target_modules": [
"o_proj",
"gate_up_proj",
"down_proj",
"qkv_proj"
],
"task_type": "CAUSAL_LM",
"use_dora": false,
"use_rslora": false
}
The reason is that vllm project treats the phi3 as llama architecture, i.e., splitting the merged qkv_proj
into q, k and v projs
. A simple workaround is to convert the tensor weight of your adapter/lora checkpoint to match it.
Here is a tested script in the gist. Feel free to use.
@Raibows thanks for your helpful python script! May I ask another question? I want to use Ollama with a fine tuned Phi3 model (using QLoRA), and now I have succeed transformed the LoRA weights into GGMl file (using llama.cpp), but I think I should merge back the qkv_proj layer weights so that I can use it on Ollama (because now I just got an error that "Error: llama runner process has terminated: signal: abort trap error:failed to apply lora adapter"). I will be grateful if you can give me some suggestions!
@Raibows thanks for the script! It worked like a charm!!!
ERROR 05-20 08:02:25 async_llm_engine.py:43] ValueError: While loading /data/llm_resume_profiles_phi3_v1_split, expected target modules in ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj', 'embed_tokens', 'lm_head'] but received ['gate_up_proj']. Please verify that the loaded LoRA module is correct^M
can we also fix gate_up_proj in a similar way? i am using phi3-128k version.