vllm
vllm copied to clipboard
Support LoRA adapter
hi guys,
We found that infer with vllm can greatly improve performance! But we need to use LoRA(peft) in inference.
We also found that the community has a strong demand for lora. https://github.com/vllm-project/vllm/issues/182
After reading the model implementation of vllm, we found there are some differences from huggingface's transformer, so we cannot directly use peft to add LoRA with vllm.
So we added an extra to add LoRA weights to qkv. The following is an example of use:
from vllm import LLM, SamplingParams
from vllm.model_executor.adapters import lora
# Create an LLM.
llm = LLM(model="facebook/opt-125m", gpu_memory_utilization=0.05)
# Add LoRA adapter
lora.LoRAModel.from_pretrained(llm.llm_engine.workers[0].model, "edbeeching/opt-125m-imdb-lora")
prompts = [
"Hello, my name is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0, top_k=-1)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Currently only supports the lora model with ["q_proj", "v_proj"] target modules, like opt, llama.
And, it's not yet supported to use LoRA in the case of using parallel
There is no module named 'vllm.model_executor.adapters'
@FarziBuilder The code is not part of this repo .. It is in a different fork
@Saiteja-Tallam-Infrrd so I need to git clone and pip install from that fork. What fork has he written this code?
@Saiteja-Tallam-Infrrd what fork are you referring to? I pip installed the troph-team:support_peft fork on the support_peft branch and got the same error as @FarziBuilder when trying to run from vllm.model_executor.adapters import lora : There is no module named 'vllm.model_executor.adapters'
@efraisse I installed from the mentioned fork and I was able to use it ..
@Saiteja-Tallam-Infrrd I think I made a mistake while cloning the repo. I was able to get it to work as well.
Hey, I see that this only works for q,v loras. However most of the Qlora fine tunes use all k,q,v,o,up and down proj layers for llama architecture. Is there a way to get all of them to work?
Do you have to pull down Llama2 commits and merge them together in the meantime to work with L2 models?
It seems that it works only on single gpu inference and does not support tensor parallel. Could it be support in future or any quick way to make it work with Ray?
edbeeching/opt-125m-imdb-lora
Thank you very much for your excellent work! It really helps.
There is an error on my side:
File "/projectnb/pnn/test_2/IntuitLLMProject/lib/data_manager.py", line 135, in d_eval_g_data_loader
lora.LoRAModel.from_pretrained(pipe.llm_engine.workers[0].model, g_saver_dir + '/adapter')
File "/projectnb/pnn/test_2/vllm/vllm/model_executor/adapters/lora.py", line 62, in from_pretrained
cls.load_adapter(layers, config)
File "/projectnb/pnn/test_2/vllm/vllm/model_executor/adapters/lora.py", line 70, in load_adapter
new_model = VllmLoRA(
File "/projectnb/pnn/test_2/vllm/vllm/model_executor/adapters/lora.py", line 34, in __init__
self.active_adapter = adapter_name
File "/usr4/ec523/brucejia/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1754, in __setattr__
super().__setattr__(name, value)
AttributeError: can't set attribute
My adapter_config.json file is as follows:
{
"alpha_pattern": {},
"auto_mapping": null,
"base_model_name_or_path": "meta-llama/Llama-2-7b-hf",
"bias": "none",
"fan_in_fan_out": false,
"inference_mode": true,
"init_lora_weights": true,
"layers_pattern": null,
"layers_to_transform": null,
"lora_alpha": 16,
"lora_dropout": 0.1,
"modules_to_save": null,
"peft_type": "LORA",
"r": 64,
"rank_pattern": {},
"revision": null,
"target_modules": [
"v_proj",
"q_proj"
],
"task_type": "CAUSAL_LM"
}
Thank you very much in advance!
edbeeching/opt-125m-imdb-loraThank you very much for your excellent work! It really helps.
There is an error on my side:
File "/projectnb/pnn/test_2/IntuitLLMProject/lib/data_manager.py", line 135, in d_eval_g_data_loader lora.LoRAModel.from_pretrained(pipe.llm_engine.workers[0].model, g_saver_dir + '/adapter') File "/projectnb/pnn/test_2/vllm/vllm/model_executor/adapters/lora.py", line 62, in from_pretrained cls.load_adapter(layers, config) File "/projectnb/pnn/test_2/vllm/vllm/model_executor/adapters/lora.py", line 70, in load_adapter new_model = VllmLoRA( File "/projectnb/pnn/test_2/vllm/vllm/model_executor/adapters/lora.py", line 34, in __init__ self.active_adapter = adapter_name File "/usr4/ec523/brucejia/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1754, in __setattr__ super().__setattr__(name, value) AttributeError: can't set attributeMy
adapter_config.jsonfile is as follows:{ "alpha_pattern": {}, "auto_mapping": null, "base_model_name_or_path": "meta-llama/Llama-2-7b-hf", "bias": "none", "fan_in_fan_out": false, "inference_mode": true, "init_lora_weights": true, "layers_pattern": null, "layers_to_transform": null, "lora_alpha": 16, "lora_dropout": 0.1, "modules_to_save": null, "peft_type": "LORA", "r": 64, "rank_pattern": {}, "revision": null, "target_modules": [ "v_proj", "q_proj" ], "task_type": "CAUSAL_LM" }Thank you very much in advance!
Please note that this problem can be solved by commenting this line.
# self.active_adapter = adapter_name
And thank you very much again for your excellent work! @mymusise
@Saiteja-Tallam-Infrrd what fork are you referring to? I pip installed the troph-team:support_peft fork on the support_peft branch and got the same error as @FarziBuilder when trying to run
from vllm.model_executor.adapters import lora: There is no module named 'vllm.model_executor.adapters'
git clone --branch support_peft https://github.com/troph-team/vllm.git
Note that for anyone else watching this issue who missed the news, there's an active PR in to vLLM to add most of the tricks from the S-LoRA paper, which is a very elegant way of serving up to thousands of LoRAs simultaneously! https://github.com/vllm-project/vllm/pull/1804
Wow in what cases do we have to serve thousands of LoRAs? ᐧ
On Sat, Dec 16, 2023 at 9:10 AM Kyle Corbitt @.***> wrote:
Note that for anyone else watching this issue who missed the news, there's an active PR in to vLLM to add most of the tricks from the S-LoRA paper, which is a very elegant way of serving up to thousands of LoRAs simultaneously! #1804 https://github.com/vllm-project/vllm/pull/1804
— Reply to this email directly, view it on GitHub https://github.com/vllm-project/vllm/pull/289#issuecomment-1858702100, or unsubscribe https://github.com/notifications/unsubscribe-auth/A25RQLLMWUQVEPUCMMH65XTYJUJ3ZAVCNFSM6AAAAAAZWZ7LJKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYG4YDEMJQGA . You are receiving this because you were mentioned.Message ID: @.***>
Note that for anyone else watching this issue who missed the news, there's an active PR in to vLLM to add most of the tricks from the S-LoRA paper, which is a very elegant way of serving up to thousands of LoRAs simultaneously! #1804
@corbt That sounds great! Thank you so much for the update!
Great work, I'm waiting for this FEATURE, when will this PR be merged?
mark
mark. Is this any merged PR for LoRA which support target modules including linear layers (o_proj, lm_head , etc ...)?
Traceback (most recent call last):
File "../lora_inference.py", line 62, in
Does anyone have the same error as me?
Traceback (most recent call last): File "../lora_inference.py", line 62, in lora.LoRAModel.from_pretrained(llm.llm_engine.workers[0].model, "....") File ".../python3.10/site-packagesllm/model_executor/adapters/lora.py", line 62, in from_pretrained cls.load_adapter(layers, config) File ".../python3.10/site-packagesllm/model_executor/adapters/lora.py", line 70, in load_adapter new_model = VllmLoRA( File "../python3.10/site-packagesllm/model_executor/adapters/lora.py", line 27, in init ColumnParallelLinear.init(self, input_size, output_size, *args, **kwargs) File ".../python3.10/site-packagesllm/model_executor/parallel_utils/tensor_parallel/layers.py", line 272, in init self.weight = Parameter(torch.empty( File "../python3.10/site-packages/torch/nn/modules/module.py", line 1712, in setattr self.register_parameter(name, value) File "../python3.10/site-packages/torch/nn/modules/module.py", line 577, in register_parameter elif hasattr(self, name) and name not in self._parameters: File "../python3.10/site-packages/peft/tuners/tuners_utils.py", line 355, in weight weight = base_layer.weight File "../python3.10/site-packages/peft/tuners/tuners_utils.py", line 355, in weight weight = base_layer.weight File "../python3.10/site-packages/peft/tuners/tuners_utils.py", line 355, in weight weight = base_layer.weight [Previous line repeated 984 more times] File "../python3.10/site-packages/peft/tuners/tuners_utils.py", line 349, in weight base_layer = self.get_base_layer() File "../python3.10/site-packages/peft/tuners/tuners_utils.py", line 338, in get_base_layer while hasattr(base_layer, "base_layer"): File "../python3.10/site-packages/torch/nn/modules/module.py", line 1695, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") RecursionError: maximum recursion depth exceeded while calling a Python object
Does anyone have the same error as me?
Please check my solution: https://github.com/SuperBruceJia/vllm
git clone --branch support_peft https://github.com/SuperBruceJia/vllm.git
cd vllm
pip install -e . --user
Special Notice:
- Only support
target_modules=[
"q_proj",
"k_proj",
"v_proj",
],
- Only support inference by one GPU
Please let me know if you have any questions!
Best regards,
Shuyue Dec. 30th, 2023
@mymusise Thank you for your code. I can now load the LoRa generated from fine-tuning the LLama2-7b-chat-hf bin model. I've noticed that it performs quite consistently in generative tasks, but when it comes to detailed inference tasks, its output results are unstable, and the error rate is relatively high. If I don't load it through vllm, I can consistently infer the correct content. Have you encountered this situation before?
@SuperBruceJia, does your solution accommodate the ChatGLM2 model? If I intend to utilize ChatGLM2, which codes should I modify? I presume I need to add a MODEL_LAYER_MAPPING in mapping.py, yet the layer names differ from those of Llama,and It seems that the code does not adapt to the structure .I greatly appreciate your assistance
@SuperBruceJia, does your solution accommodate the ChatGLM2 model? If I intend to utilize ChatGLM2, which codes should I modify? I presume I need to add a MODEL_LAYER_MAPPING in mapping.py, yet the layer names differ from those of Llama,and It seems that the code does not adapt to the structure .I greatly appreciate your assistance
I think you could, but you need to have a LoRa adapter for the ChatGLM2 model.
First of all, you are suggested to add a LoRA adapter to your base ChatGLM2 Model:
from peft import LoraConfig, TaskType
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
Trainer,
)
model = AutoModelForCausalLM.from_pretrained("THUDM/chatglm2-6b")
lora_config = LoraConfig(
r=lora_r,
lora_alpha=lora_alpha,
lora_dropout=lora_dropout,
bias="none",
target_modules=[
"q_proj",
"k_proj",
"v_proj",
],
task_type=TaskType.CAUSAL_LM,
)
model.add_adapter(lora_config, adapter_name="adapter")
model.enable_adapters()
After connecting (and maybe going through several rounds of training) the adapter, you need to save it to a folder in your local directory.
trainer.train() # Train the adapter
trainer.model.save_pretrained(save_path) # Only the adapter will be saved.
Afterwards, you can load the base model + adapter using vLLM:
from vllm import LLM, SamplingParams
from vllm.model_executor.adapters import lora
# Create an LLM.
llm = LLM(model="THUDM/chatglm2-6b", gpu_memory_utilization=0.85)
# Add LoRA adapter
lora.LoRAModel.from_pretrained(llm.llm_engine.workers[0].model, "save_path")
prompts = [
"Hello, my name is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0, top_k=-1)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
If you have any further questions, please let me know.
Best regards,
Shuyue Jan. 16th, 2024
@SuperBruceJia, does your solution accommodate the ChatGLM2 model? If I intend to utilize ChatGLM2, which codes should I modify? I presume I need to add a MODEL_LAYER_MAPPING in mapping.py, yet the layer names differ from those of Llama,and It seems that the code does not adapt to the structure .I greatly appreciate your assistance
Please take a look at the fine-tuning codes of the LLaMA 2 (7B) model. Main execution file: https://github.com/SuperBruceJia/MetaMath-Fine-Tune-with-LoRA/blob/main/main.py Model loader: https://github.com/SuperBruceJia/MetaMath-Fine-Tune-with-LoRA/blob/main/lib/model_loader.py#L98-L108 Evaluation: https://github.com/SuperBruceJia/MetaMath-Fine-Tune-with-LoRA/blob/main/lib/evaluation.py#L129-L137
Best regards,
Shuyue Jan. 16th, 2024
@SuperBruceJia Hello, I can now load the LoRa generated from fine-tuning the LLama2-7b-chat-hf bin model. I've noticed that it performs quite consistently in generative tasks, but when it comes to detailed inference tasks, its output results are unstable, and the error rate is relatively high. If I don't load it through vllm, I can consistently infer the correct content. Have you encountered this situation before?
@SuperBruceJia Hello, I can now load the LoRa generated from fine-tuning the LLama2-7b-chat-hf bin model. I've noticed that it performs quite consistently in generative tasks, but when it comes to detailed inference tasks, its output results are unstable, and the error rate is relatively high. If I don't load it through vllm, I can consistently infer the correct content. Have you encountered this situation before?
During inference using the fine-tuned pre-trained model, the model's generations were worse. However, the performance was pretty good under the setting fixed pre-trained models + trained LoRA adapters.
Like this:
llama_path = "YOUR_LLAMA_MODEL_PATH" # The original pre-trained model is not fine-tuned
adapter_path = "YOUR_SAVED_ADAPTER_PATH" # Only the LoRA adapter is fine-tuned
llm = LLM(model=llama_path, tensor_parallel_size=1, gpu_memory_utilization=0.85)
lora.LoRAModel.from_pretrained(llm.llm_engine.workers[0].model, adapter_path)
Please inform me if you have found a solution.
Best regards,
Shuyue Jan. 22nd, 2024
@SuperBruceJia Hello, I can now load the LoRa generated from fine-tuning the LLama2-7b-chat-hf bin model. I've noticed that it performs quite consistently in generative tasks, but when it comes to detailed inference tasks, its output results are unstable, and the error rate is relatively high. If I don't load it through vllm, I can consistently infer the correct content. Have you encountered this situation before?
During inference using the fine-tuned pre-trained model, the model's generations were worse. However, the performance was pretty good under the setting
fixed pre-trained models + trained LoRA adapters.Like this:
llama_path = "YOUR_LLAMA_MODEL_PATH" # The original pre-trained model is not fine-tuned adapter_path = "YOUR_SAVED_ADAPTER_PATH" # Only the LoRA adapter is fine-tuned llm = LLM(model=llama_path, tensor_parallel_size=1, gpu_memory_utilization=0.85) lora.LoRAModel.from_pretrained(llm.llm_engine.workers[0].model, adapter_path)Please inform me if you have found a solution.
Best regards,
Shuyue Jan. 22nd, 2024
@SuperBruceJia Thank you for your code. This way can make the response better, but it still worse than the way without vllm.
Solved by #1804
Hi @SuperBruceJia, Thank you for providing the lora support code. I tried to install from source. But received error with pyproject. Do you have any idea on how to fix this?
Hi @SuperBruceJia, Thank you for providing the lora support code. I tried to install from source. But received error with pyproject. Do you have any idea on how to fix this?
![]()
Sorry, I haven't met this issue yet.
It seems that the issue is related to the version of CUDA being used, as described here.