vllm Unable to load LoRA fine-tuned LLM from HF (AssertionError)

Following the docs about Using LoRA Adapters, I am finding an assert problem. My code:

from huggingface_hub import snapshot_download
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest

lora_path = snapshot_download(repo_id="<my repo id>")

llm = LLM(
        model="mistralai/Mistral-7B-v0.1",
        tokenizer="<my tokenizer>",
        enable_lora=True)

sampling_params = SamplingParams(
    temperature=0,
    max_tokens=256,
    stop=["<|endcontext|>"]
)

prompts = [
    "<|begincontext|><|user|>I'm hungry. Find places to eat please.<|system|>Sure thing. Which city would you like to eat in?<|user|>Let's go with Foster City please.<|system|>Sure. What kind of food are you hungry for?<|user|>Spicy Indian sound really good.<|system|>One moment. I found a great restaurant called Pastries N Chaat in Foster City.<|user|>Give me other suggestions as well<|system|>How about, Tabla Indian Restaurant in Foster City?<|user|>Can you find out if they are average priced?<|system|>sure. The price range would be inexpensive.<|user|>Perfect. That works<|system|>Should I reserve for you?<|beginlastuserutterance|>Yes, go ahead and do that.<|endlastuserutterance|><|endcontext|>"
]

outputs = llm.generate(
    prompts,
    sampling_params,
    lora_request=LoRARequest("lora_adapter", 1, lora_path)
)
print(outputs)

The error:

...
INFO 03-14 11:33:38 model_runner.py:756] Graph capturing finished in 7 secs.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Processed prompts:   0%|                                                                                                                                                                                                                 | 0/1 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/ubuntu/vllm/.venv/lib/python3.10/site-packages/vllm/lora/worker_manager.py", line 139, in _load_lora
    lora = self._lora_model_cls.from_local_checkpoint(
  File "/home/ubuntu/vllm/.venv/lib/python3.10/site-packages/vllm/lora/models.py", line 227, in from_local_checkpoint
    return cls.from_lora_tensors(
  File "/home/ubuntu/vllm/.venv/lib/python3.10/site-packages/vllm/lora/models.py", line 148, in from_lora_tensors
    module_name, is_lora_a = parse_fine_tuned_lora_name(tensor_name)
  File "/home/ubuntu/vllm/.venv/lib/python3.10/site-packages/vllm/lora/utils.py", line 33, in parse_fine_tuned_lora_name
    assert parts[-2] == "lora_A" or parts[-2] == "lora_B"
AssertionError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ubuntu/vllm/vllm-lora-check.py", line 22, in <module>
    outputs = llm.generate(
  File "/home/ubuntu/vllm/.venv/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 182, in generate
    return self._run_engine(use_tqdm)
  File "/home/ubuntu/vllm/.venv/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 208, in _run_engine
    step_outputs = self.llm_engine.step()
  File "/home/ubuntu/vllm/.venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 838, in step
    all_outputs = self._run_workers(
  File "/home/ubuntu/vllm/.venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1041, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/home/ubuntu/vllm/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/vllm/.venv/lib/python3.10/site-packages/vllm/worker/worker.py", line 223, in execute_model
    output = self.model_runner.execute_model(seq_group_metadata_list,
  File "/home/ubuntu/vllm/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/vllm/.venv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 574, in execute_model
    self.set_active_loras(lora_requests, lora_mapping)
  File "/home/ubuntu/vllm/.venv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 660, in set_active_loras
    self.lora_manager.set_active_loras(lora_requests, lora_mapping)
  File "/home/ubuntu/vllm/.venv/lib/python3.10/site-packages/vllm/lora/worker_manager.py", line 112, in set_active_loras
    self._apply_loras(lora_requests)
  File "/home/ubuntu/vllm/.venv/lib/python3.10/site-packages/vllm/lora/worker_manager.py", line 224, in _apply_loras
    self.add_lora(lora)
  File "/home/ubuntu/vllm/.venv/lib/python3.10/site-packages/vllm/lora/worker_manager.py", line 231, in add_lora
    lora = self._load_lora(lora_request)
  File "/home/ubuntu/vllm/.venv/lib/python3.10/site-packages/vllm/lora/worker_manager.py", line 150, in _load_lora
    raise RuntimeError(
RuntimeError: Loading lora /home/ubuntu/.cache/huggingface/hub/models--bla-bla/snapshots/316a6e3610eedf49d5cb04b4670942f425401ee9 failed

By comparing the model found in the aforementioned documentation, I realized my model is "exporting" a couple of tensors (found in the adapter_model.safetensors file) that are not expected by vLLM code to be there, namely:

base_model.model.lm_head.base_layer.weight, and
base_model.model.model.embed_tokens.base_layer.weight.

This code will crash if weight-named tensors are not "coming" from lora (by looking at the tensor name).

In the model used for the documentation, all tensors contain 'lora' in their names.

I am pretty new to this and followed this fine-tuning guide.

The question is how can I "fix" this issue? Is the problem related to the fine-tuning guide? Maybe because the LoRAConfig is not correct or because the way the model is persisted. Is it instead related to vLLM?

Thanks!

Mar 14 '24 12:03 oscar-martin

I also encountered this problem. The name looks like

Mar 18 '24 08:03 geknow

I also encountered the same error. This happens because (https://github.com/vllm-project/vllm/issues/2816) the peft library saves the base embedding layers as well when save() is called - https://github.com/huggingface/peft/blob/8dd45b75d7eabe7ee94ecb6a19d552f2aa5e98c6/src/peft/utils/save_and_load.py#L175. This is not supported in vllm apparently. If you are not training with new special tokens and your base embeddings have not updated - you can just remove the base layer weights. I used the following code:

lora_path = 'YOUR_ADAPTER_PATH'
import safetensors.torch
tensors =  safetensors.torch.load_file(lora_path)

nonlora_keys = []
for k in list(tensors.keys()):
    if "lora" not in k:
        nonlora_keys.append(k)

print(nonlora_keys) # just take a look what they are

for k in nonlora_keys:
    del tensors[k]

safetensors.torch.save_file(tensors, 'NEW_ADAPTER_PATH')

Mar 19 '24 15:03 sagar-deepscribe

Thanks @sagar-deepscribe! In my case, I need new special tokens but this is good stuff for me to learn.

Mar 20 '24 06:03 oscar-martin

I also encountered the same error. This happens because (#2816) the peft library saves the base embedding layers as well when save() is called - https://github.com/huggingface/peft/blob/8dd45b75d7eabe7ee94ecb6a19d552f2aa5e98c6/src/peft/utils/save_and_load.py#L175. This is not supported in vllm apparently. If you are not training with new special tokens and your base embeddings have not updated - you can just remove the base layer weights. I used the following code:
lora_path = 'YOUR_ADAPTER_PATH'
import safetensors.torch
tensors =  safetensors.torch.load_file(lora_path)

nonlora_keys = []
for k in list(tensors.keys()):
    if "lora" not in k:
        nonlora_keys.append(k)

print(nonlora_keys) # just take a look what they are

for k in nonlora_keys:
    del tensors[k]

safetensors.torch.save_file(tensors, 'NEW_ADAPTER_PATH')

The adapter path is usually a file name "adapter_model.safetensors"

Mar 31 '24 19:03 tsvisab

thanks, @sagar-deepscribe - that was helpful.

Here is your code slightly edited and w/ copy-n-paste instructions to run:

cat << EOT > vllm-lora-convert.py
import sys
import safetensors.torch

src, dst = sys.argv[-2:]

tensors = safetensors.torch.load_file(f"{src}/adapter_model.safetensors")

non_lora_keys = [k for k in tensors.keys() if "lora" not in k]

print("splitting non-lora keys into a separate file")
print("lora keys: ", tensors.keys())
print("non-lora keys: ", non_lora_keys)

non_lora_tensors = {k:tensors.pop(k) for k in non_lora_keys}

safetensors.torch.save_file(tensors, f"{dst}/adapter_model.safetensors")
safetensors.torch.save_file(non_lora_tensors, f"{dst}/rest.safetensors")
EOT

dir=unwrapped_model # edit to the dir with lora weights and config files
cp -r $dir $dir-vllm
python vllm-lora-convert.py $dir $dir-vllm

May 03 '24 05:05 stas00

vllm vllm copied to clipboard

Unable to load LoRA fine-tuned LLM from HF (AssertionError)

vllm
vllm copied to clipboard