vllm [Bug]: Not able to do lora inference with phi-3

Your current environment

The output of `python collect_env.py`

🐛 Describe the bug

The following error appeared when trying to do lora inference with phi-3 using the newest vllm version:

Exception while reading stream response: Loading lora data/loras/jt_snc_dpo failed

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 150, in _load_lora
    lora = self._lora_model_cls.from_local_checkpoint(
  File "/usr/local/lib/python3.10/dist-packages/vllm/lora/models.py", line 225, in from_local_checkpoint
    raise ValueError(
ValueError: While loading data/loras/jt_snc_dpo, expected target modules in ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj', 'embed_tokens', 'lm_head'] but received ['gate_up_proj', 'qkv_proj']. Please verify that the loaded LoRA module is correct

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/model_wrapper.py", line 269, in write_response_to_queue
    async for chunk in generator:
  File "/app/model/model.py", line 50, in generator
    async for output in vllm_generator:
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 666, in generate
    raise e
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 660, in generate
    async for request_output in stream:
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 77, in __anext__
    raise result
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
    task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 501, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 475, in engine_step
    request_outputs = await self.engine.step_async()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 221, in step_async
    output = await self.model_executor.execute_model_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 148, in execute_model_async
    output = await make_async(self.driver_worker.execute_model
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 249, in execute_model
    output = self.model_runner.execute_model(seq_group_metadata_list,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 790, in execute_model
    self.set_active_loras(lora_requests, lora_mapping)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 901, in set_active_loras
    self.lora_manager.set_active_loras(lora_requests, lora_mapping)
  File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 113, in set_active_loras
    self._apply_loras(lora_requests)
  File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 235, in _apply_loras
    self.add_lora(lora)
  File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 243, in add_lora
    lora = self._load_lora(lora_request)
  File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 162, in _load_lora
    raise RuntimeError(
RuntimeError: Loading lora data/loras/jt_snc_dpo failed

Below is the config file of the adapter:

{
  "alpha_pattern": {},
  "auto_mapping": null,
  "base_model_name_or_path": "microsoft/Phi-3-mini-128k-instruct",
  "bias": "none",
  "fan_in_fan_out": false,
  "inference_mode": true,
  "init_lora_weights": true,
  "layer_replication": null,
  "layers_pattern": null,
  "layers_to_transform": null,
  "loftq_config": {},
  "lora_alpha": 64,
  "lora_dropout": 0.1,
  "megatron_config": null,
  "megatron_core": "megatron.core",
  "modules_to_save": null,
  "peft_type": "LORA",
  "r": 32,
  "rank_pattern": {},
  "revision": null,
  "target_modules": [
    "o_proj",
    "gate_up_proj",
    "down_proj",
    "qkv_proj"
  ],
  "task_type": "CAUSAL_LM",
  "use_dora": false,
  "use_rslora": false
}

May 09 '24 14:05 WeiXiaoSummer

The reason is that vllm project treats the phi3 as llama architecture, i.e., splitting the merged qkv_proj into q, k and v projs. A simple workaround is to convert the tensor weight of your adapter/lora checkpoint to match it.

Here is a tested script in the gist. Feel free to use.

May 13 '24 02:05 Raibows

@Raibows thanks for your helpful python script! May I ask another question? I want to use Ollama with a fine tuned Phi3 model (using QLoRA), and now I have succeed transformed the LoRA weights into GGMl file (using llama.cpp), but I think I should merge back the qkv_proj layer weights so that I can use it on Ollama (because now I just got an error that "Error: llama runner process has terminated: signal: abort trap error:failed to apply lora adapter"). I will be grateful if you can give me some suggestions!

May 14 '24 03:05 SHIMURA0

@Raibows thanks for the script! It worked like a charm!!!

May 17 '24 20:05 WeiXiaoSummer

ERROR 05-20 08:02:25 async_llm_engine.py:43] ValueError: While loading /data/llm_resume_profiles_phi3_v1_split, expected target modules in ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj', 'embed_tokens', 'lm_head'] but received ['gate_up_proj']. Please verify that the loaded LoRA module is correct^M

can we also fix gate_up_proj in a similar way? i am using phi3-128k version.

May 20 '24 08:05 arunpatala

vllm vllm copied to clipboard

[Bug]: Not able to do lora inference with phi-3

Your current environment

🐛 Describe the bug

vllm
vllm copied to clipboard