vllm Support LoRA adapter

Support LoRA adapter

Open mymusise opened this issue 1 year ago • 21 comments

hi guys, We found that infer with vllm can greatly improve performance! But we need to use LoRA(peft) in inference. We also found that the community has a strong demand for lora. https://github.com/vllm-project/vllm/issues/182 After reading the model implementation of vllm, we found there are some differences from huggingface's transformer, so we cannot directly use peft to add LoRA with vllm.

So we added an extra to add LoRA weights to qkv. The following is an example of use:

from vllm import LLM, SamplingParams
from vllm.model_executor.adapters import lora

# Create an LLM.
llm = LLM(model="facebook/opt-125m", gpu_memory_utilization=0.05)

# Add LoRA adapter
lora.LoRAModel.from_pretrained(llm.llm_engine.workers[0].model, "edbeeching/opt-125m-imdb-lora")

prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = SamplingParams(temperature=0, top_k=-1)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Currently only supports the lora model with ["q_proj", "v_proj"] target modules, like opt, llama.

And, it's not yet supported to use LoRA in the case of using parallel

Jun 28 '23 09:06 mymusise

There is no module named 'vllm.model_executor.adapters'

Jul 13 '23 13:07 FarziBuilder

@FarziBuilder The code is not part of this repo .. It is in a different fork

Jul 14 '23 05:07 Saiteja-Tallam-Infrrd

@Saiteja-Tallam-Infrrd so I need to git clone and pip install from that fork. What fork has he written this code?

Jul 19 '23 19:07 FarziBuilder

@Saiteja-Tallam-Infrrd what fork are you referring to? I pip installed the troph-team:support_peft fork on the support_peft branch and got the same error as @FarziBuilder when trying to run from vllm.model_executor.adapters import lora : There is no module named 'vllm.model_executor.adapters'

Jul 25 '23 16:07 efraisse

@efraisse I installed from the mentioned fork and I was able to use it ..

Jul 26 '23 13:07 Saiteja-Tallam-Infrrd

@Saiteja-Tallam-Infrrd I think I made a mistake while cloning the repo. I was able to get it to work as well.

Jul 26 '23 14:07 efraisse

Hey, I see that this only works for q,v loras. However most of the Qlora fine tunes use all k,q,v,o,up and down proj layers for llama architecture. Is there a way to get all of them to work?

Jul 28 '23 17:07 nivibilla

Do you have to pull down Llama2 commits and merge them together in the meantime to work with L2 models?

Jul 29 '23 01:07 admangan

It seems that it works only on single gpu inference and does not support tensor parallel. Could it be support in future or any quick way to make it work with Ray?

Oct 08 '23 02:10 Rannichan

edbeeching/opt-125m-imdb-lora

Thank you very much for your excellent work! It really helps.

There is an error on my side:

File "/projectnb/pnn/test_2/IntuitLLMProject/lib/data_manager.py", line 135, in d_eval_g_data_loader
  lora.LoRAModel.from_pretrained(pipe.llm_engine.workers[0].model, g_saver_dir + '/adapter')
File "/projectnb/pnn/test_2/vllm/vllm/model_executor/adapters/lora.py", line 62, in from_pretrained
  cls.load_adapter(layers, config)
File "/projectnb/pnn/test_2/vllm/vllm/model_executor/adapters/lora.py", line 70, in load_adapter
  new_model = VllmLoRA(
File "/projectnb/pnn/test_2/vllm/vllm/model_executor/adapters/lora.py", line 34, in __init__
  self.active_adapter = adapter_name
File "/usr4/ec523/brucejia/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1754, in __setattr__
  super().__setattr__(name, value)
AttributeError: can't set attribute

My adapter_config.json file is as follows:

{
  "alpha_pattern": {},
  "auto_mapping": null,
  "base_model_name_or_path": "meta-llama/Llama-2-7b-hf",
  "bias": "none",
  "fan_in_fan_out": false,
  "inference_mode": true,
  "init_lora_weights": true,
  "layers_pattern": null,
  "layers_to_transform": null,
  "lora_alpha": 16,
  "lora_dropout": 0.1,
  "modules_to_save": null,
  "peft_type": "LORA",
  "r": 64,
  "rank_pattern": {},
  "revision": null,
  "target_modules": [
    "v_proj",
    "q_proj"
  ],
  "task_type": "CAUSAL_LM"
}

Thank you very much in advance!

Nov 21 '23 21:11 SuperBruceJia

edbeeching/opt-125m-imdb-lora

Thank you very much for your excellent work! It really helps.

There is an error on my side:

File "/projectnb/pnn/test_2/IntuitLLMProject/lib/data_manager.py", line 135, in d_eval_g_data_loader
  lora.LoRAModel.from_pretrained(pipe.llm_engine.workers[0].model, g_saver_dir + '/adapter')
File "/projectnb/pnn/test_2/vllm/vllm/model_executor/adapters/lora.py", line 62, in from_pretrained
  cls.load_adapter(layers, config)
File "/projectnb/pnn/test_2/vllm/vllm/model_executor/adapters/lora.py", line 70, in load_adapter
  new_model = VllmLoRA(
File "/projectnb/pnn/test_2/vllm/vllm/model_executor/adapters/lora.py", line 34, in __init__
  self.active_adapter = adapter_name
File "/usr4/ec523/brucejia/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1754, in __setattr__
  super().__setattr__(name, value)
AttributeError: can't set attribute

My adapter_config.json file is as follows:

{
  "alpha_pattern": {},
  "auto_mapping": null,
  "base_model_name_or_path": "meta-llama/Llama-2-7b-hf",
  "bias": "none",
  "fan_in_fan_out": false,
  "inference_mode": true,
  "init_lora_weights": true,
  "layers_pattern": null,
  "layers_to_transform": null,
  "lora_alpha": 16,
  "lora_dropout": 0.1,
  "modules_to_save": null,
  "peft_type": "LORA",
  "r": 64,
  "rank_pattern": {},
  "revision": null,
  "target_modules": [
    "v_proj",
    "q_proj"
  ],
  "task_type": "CAUSAL_LM"
}

Thank you very much in advance!

Please note that this problem can be solved by commenting this line.

# self.active_adapter = adapter_name

And thank you very much again for your excellent work! @mymusise

Nov 21 '23 22:11 SuperBruceJia

@Saiteja-Tallam-Infrrd what fork are you referring to? I pip installed the troph-team:support_peft fork on the support_peft branch and got the same error as @FarziBuilder when trying to run from vllm.model_executor.adapters import lora : There is no module named 'vllm.model_executor.adapters'

git clone --branch support_peft https://github.com/troph-team/vllm.git

Nov 22 '23 19:11 SuperBruceJia

Note that for anyone else watching this issue who missed the news, there's an active PR in to vLLM to add most of the tricks from the S-LoRA paper, which is a very elegant way of serving up to thousands of LoRAs simultaneously! https://github.com/vllm-project/vllm/pull/1804

Dec 16 '23 03:12 corbt

Wow in what cases do we have to serve thousands of LoRAs? ᐧ

On Sat, Dec 16, 2023 at 9:10 AM Kyle Corbitt @.***> wrote:

Note that for anyone else watching this issue who missed the news, there's an active PR in to vLLM to add most of the tricks from the S-LoRA paper, which is a very elegant way of serving up to thousands of LoRAs simultaneously! #1804 https://github.com/vllm-project/vllm/pull/1804

— Reply to this email directly, view it on GitHub https://github.com/vllm-project/vllm/pull/289#issuecomment-1858702100, or unsubscribe https://github.com/notifications/unsubscribe-auth/A25RQLLMWUQVEPUCMMH65XTYJUJ3ZAVCNFSM6AAAAAAZWZ7LJKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYG4YDEMJQGA . You are receiving this because you were mentioned.Message ID: @.***>

Dec 16 '23 08:12 FarziBuilder

Note that for anyone else watching this issue who missed the news, there's an active PR in to vLLM to add most of the tricks from the S-LoRA paper, which is a very elegant way of serving up to thousands of LoRAs simultaneously! #1804

@corbt That sounds great! Thank you so much for the update!

Dec 16 '23 20:12 SuperBruceJia

Great work, I'm waiting for this FEATURE, when will this PR be merged?

Dec 18 '23 09:12 oushu1zhangxiangxuan1

mark

Dec 20 '23 02:12 callanwu

mark. Is this any merged PR for LoRA which support target modules including linear layers (o_proj, lm_head , etc ...)?

Dec 28 '23 05:12 allzero-kwon

Traceback (most recent call last): File "../lora_inference.py", line 62, in lora.LoRAModel.from_pretrained(llm.llm_engine.workers[0].model, "....") File ".../python3.10/site-packagesllm/model_executor/adapters/lora.py", line 62, in from_pretrained cls.load_adapter(layers, config) File ".../python3.10/site-packagesllm/model_executor/adapters/lora.py", line 70, in load_adapter new_model = VllmLoRA( File "../python3.10/site-packagesllm/model_executor/adapters/lora.py", line 27, in init ColumnParallelLinear.init(self, input_size, output_size, *args, **kwargs) File ".../python3.10/site-packagesllm/model_executor/parallel_utils/tensor_parallel/layers.py", line 272, in init self.weight = Parameter(torch.empty( File "../python3.10/site-packages/torch/nn/modules/module.py", line 1712, in setattr self.register_parameter(name, value) File "../python3.10/site-packages/torch/nn/modules/module.py", line 577, in register_parameter elif hasattr(self, name) and name not in self._parameters: File "../python3.10/site-packages/peft/tuners/tuners_utils.py", line 355, in weight weight = base_layer.weight File "../python3.10/site-packages/peft/tuners/tuners_utils.py", line 355, in weight weight = base_layer.weight File "../python3.10/site-packages/peft/tuners/tuners_utils.py", line 355, in weight weight = base_layer.weight [Previous line repeated 984 more times] File "../python3.10/site-packages/peft/tuners/tuners_utils.py", line 349, in weight base_layer = self.get_base_layer() File "../python3.10/site-packages/peft/tuners/tuners_utils.py", line 338, in get_base_layer while hasattr(base_layer, "base_layer"): File "../python3.10/site-packages/torch/nn/modules/module.py", line 1695, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") RecursionError: maximum recursion depth exceeded while calling a Python object

Does anyone have the same error as me?

Dec 28 '23 09:12 echo669

Traceback (most recent call last): File "../lora_inference.py", line 62, in lora.LoRAModel.from_pretrained(llm.llm_engine.workers[0].model, "....") File ".../python3.10/site-packagesllm/model_executor/adapters/lora.py", line 62, in from_pretrained cls.load_adapter(layers, config) File ".../python3.10/site-packagesllm/model_executor/adapters/lora.py", line 70, in load_adapter new_model = VllmLoRA( File "../python3.10/site-packagesllm/model_executor/adapters/lora.py", line 27, in init ColumnParallelLinear.init(self, input_size, output_size, *args, **kwargs) File ".../python3.10/site-packagesllm/model_executor/parallel_utils/tensor_parallel/layers.py", line 272, in init self.weight = Parameter(torch.empty( File "../python3.10/site-packages/torch/nn/modules/module.py", line 1712, in setattr self.register_parameter(name, value) File "../python3.10/site-packages/torch/nn/modules/module.py", line 577, in register_parameter elif hasattr(self, name) and name not in self._parameters: File "../python3.10/site-packages/peft/tuners/tuners_utils.py", line 355, in weight weight = base_layer.weight File "../python3.10/site-packages/peft/tuners/tuners_utils.py", line 355, in weight weight = base_layer.weight File "../python3.10/site-packages/peft/tuners/tuners_utils.py", line 355, in weight weight = base_layer.weight [Previous line repeated 984 more times] File "../python3.10/site-packages/peft/tuners/tuners_utils.py", line 349, in weight base_layer = self.get_base_layer() File "../python3.10/site-packages/peft/tuners/tuners_utils.py", line 338, in get_base_layer while hasattr(base_layer, "base_layer"): File "../python3.10/site-packages/torch/nn/modules/module.py", line 1695, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") RecursionError: maximum recursion depth exceeded while calling a Python object

Does anyone have the same error as me?

Please check my solution: https://github.com/SuperBruceJia/vllm

git clone --branch support_peft https://github.com/SuperBruceJia/vllm.git
cd vllm
pip install -e . --user

Special Notice:

Only support

target_modules=[
            "q_proj",
            "k_proj",
            "v_proj",
        ],

Only support inference by one GPU

Please let me know if you have any questions!

Best regards,

Shuyue Dec. 30th, 2023

Dec 30 '23 17:12 SuperBruceJia

@mymusise Thank you for your code. I can now load the LoRa generated from fine-tuning the LLama2-7b-chat-hf bin model. I've noticed that it performs quite consistently in generative tasks, but when it comes to detailed inference tasks, its output results are unstable, and the error rate is relatively high. If I don't load it through vllm, I can consistently infer the correct content. Have you encountered this situation before?

Jan 05 '24 03:01 Senna1960321

@SuperBruceJia, does your solution accommodate the ChatGLM2 model? If I intend to utilize ChatGLM2, which codes should I modify? I presume I need to add a MODEL_LAYER_MAPPING in mapping.py, yet the layer names differ from those of Llama,and It seems that the code does not adapt to the structure .I greatly appreciate your assistance

Jan 17 '24 03:01 xiaobo-Chen

@SuperBruceJia, does your solution accommodate the ChatGLM2 model? If I intend to utilize ChatGLM2, which codes should I modify? I presume I need to add a MODEL_LAYER_MAPPING in mapping.py, yet the layer names differ from those of Llama,and It seems that the code does not adapt to the structure .I greatly appreciate your assistance

I think you could, but you need to have a LoRa adapter for the ChatGLM2 model.

First of all, you are suggested to add a LoRA adapter to your base ChatGLM2 Model:

from peft import LoraConfig, TaskType
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
)

model = AutoModelForCausalLM.from_pretrained("THUDM/chatglm2-6b")

lora_config = LoraConfig(
    r=lora_r,
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    bias="none",
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
    ],
    task_type=TaskType.CAUSAL_LM,
)
model.add_adapter(lora_config, adapter_name="adapter")
model.enable_adapters()

After connecting (and maybe going through several rounds of training) the adapter, you need to save it to a folder in your local directory.

trainer.train()  # Train the adapter
trainer.model.save_pretrained(save_path)  # Only the adapter will be saved.

Afterwards, you can load the base model + adapter using vLLM:

from vllm import LLM, SamplingParams
from vllm.model_executor.adapters import lora

# Create an LLM.
llm = LLM(model="THUDM/chatglm2-6b", gpu_memory_utilization=0.85)

# Add LoRA adapter
lora.LoRAModel.from_pretrained(llm.llm_engine.workers[0].model, "save_path")

prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = SamplingParams(temperature=0, top_k=-1)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

If you have any further questions, please let me know.

Best regards,

Shuyue Jan. 16th, 2024

Jan 17 '24 03:01 SuperBruceJia

@SuperBruceJia, does your solution accommodate the ChatGLM2 model? If I intend to utilize ChatGLM2, which codes should I modify? I presume I need to add a MODEL_LAYER_MAPPING in mapping.py, yet the layer names differ from those of Llama,and It seems that the code does not adapt to the structure .I greatly appreciate your assistance

Please take a look at the fine-tuning codes of the LLaMA 2 (7B) model. Main execution file: https://github.com/SuperBruceJia/MetaMath-Fine-Tune-with-LoRA/blob/main/main.py Model loader: https://github.com/SuperBruceJia/MetaMath-Fine-Tune-with-LoRA/blob/main/lib/model_loader.py#L98-L108 Evaluation: https://github.com/SuperBruceJia/MetaMath-Fine-Tune-with-LoRA/blob/main/lib/evaluation.py#L129-L137

Best regards,

Shuyue Jan. 16th, 2024

Jan 17 '24 03:01 SuperBruceJia

@SuperBruceJia Hello, I can now load the LoRa generated from fine-tuning the LLama2-7b-chat-hf bin model. I've noticed that it performs quite consistently in generative tasks, but when it comes to detailed inference tasks, its output results are unstable, and the error rate is relatively high. If I don't load it through vllm, I can consistently infer the correct content. Have you encountered this situation before?

Jan 23 '24 02:01 Senna1960321

@SuperBruceJia Hello, I can now load the LoRa generated from fine-tuning the LLama2-7b-chat-hf bin model. I've noticed that it performs quite consistently in generative tasks, but when it comes to detailed inference tasks, its output results are unstable, and the error rate is relatively high. If I don't load it through vllm, I can consistently infer the correct content. Have you encountered this situation before?

During inference using the fine-tuned pre-trained model, the model's generations were worse. However, the performance was pretty good under the setting fixed pre-trained models + trained LoRA adapters.

Like this:

llama_path = "YOUR_LLAMA_MODEL_PATH"  # The original pre-trained model is not fine-tuned
adapter_path = "YOUR_SAVED_ADAPTER_PATH"  # Only the LoRA adapter is fine-tuned
llm = LLM(model=llama_path, tensor_parallel_size=1, gpu_memory_utilization=0.85)
lora.LoRAModel.from_pretrained(llm.llm_engine.workers[0].model, adapter_path)

Please inform me if you have found a solution.

Best regards,

Shuyue Jan. 22nd, 2024

Jan 23 '24 02:01 SuperBruceJia

@SuperBruceJia Hello, I can now load the LoRa generated from fine-tuning the LLama2-7b-chat-hf bin model. I've noticed that it performs quite consistently in generative tasks, but when it comes to detailed inference tasks, its output results are unstable, and the error rate is relatively high. If I don't load it through vllm, I can consistently infer the correct content. Have you encountered this situation before?

During inference using the fine-tuned pre-trained model, the model's generations were worse. However, the performance was pretty good under the setting fixed pre-trained models + trained LoRA adapters.

Like this:
llama_path = "YOUR_LLAMA_MODEL_PATH"  # The original pre-trained model is not fine-tuned
adapter_path = "YOUR_SAVED_ADAPTER_PATH"  # Only the LoRA adapter is fine-tuned
llm = LLM(model=llama_path, tensor_parallel_size=1, gpu_memory_utilization=0.85)
lora.LoRAModel.from_pretrained(llm.llm_engine.workers[0].model, adapter_path) 
Please inform me if you have found a solution.

Best regards,

Shuyue Jan. 22nd, 2024

@SuperBruceJia Thank you for your code. This way can make the response better, but it still worse than the way without vllm.

Jan 24 '24 08:01 Senna1960321

Solved by #1804

Feb 16 '24 22:02 zhuohan123

Hi @SuperBruceJia, Thank you for providing the lora support code. I tried to install from source. But received error with pyproject. Do you have any idea on how to fix this?

Mar 21 '24 13:03 meiru-cam

Hi @SuperBruceJia, Thank you for providing the lora support code. I tried to install from source. But received error with pyproject. Do you have any idea on how to fix this?

Sorry, I haven't met this issue yet.

It seems that the issue is related to the version of CUDA being used, as described here.

Mar 21 '24 14:03 SuperBruceJia

vllm vllm copied to clipboard

Support LoRA adapter

vllm
vllm copied to clipboard