text-generation-inference
text-generation-inference copied to clipboard
curious about the plans for supporting PEFT and LoRa.
Feature request
I need to be able to apply lora adapter to local llm
Motivation
lora is a good tool to lightly go through and check your current llm tuning I think you need it in the local model api
Your contribution
...
plz~
Hey, do you know about https://huggingface.co/docs/peft/main/en/package_reference/tuners#peft.LoraModel.merge_and_unload
Basically, you could
model = model.merge_and_unload()
model.save_pretrained("mynewmergedmodel")
which will "write" the peft weights directly into the model, making it a regular transformer model, which text-generation-inference could support.
@younesbelkada in case I write something wrong.
It would be nice to add full blown support, but it means reintegrating of lot of peft logic into tgi, this seems like the easier route for the time being.
I second what @Narsil said, you can do that to merge the LoRA weights, we should indeed add more documentation on PEFT
Hey, do you know about https://huggingface.co/docs/peft/main/en/package_reference/tuners#peft.LoraModel.merge_and_unload
Basically, you could
model = model.merge_and_unload() model.save_pretrained("mynewmergedmodel")which will "write" the peft weights directly into the model, making it a regular transformer model, which
text-generation-inferencecould support.@younesbelkada in case I write something wrong.
It would be nice to add full blown support, but it means reintegrating of lot of peft logic into tgi, this seems like the easier route for the time being.
@Narsil
If, i want to use like this,
# model_name is request body data....
if model_name == "chat":
with self.model.disable_adapter():
model_output = self.model.generate(
input_ids=input_ids.cuda(), generation_config=generation_config
)[0]
else:
self.model.set_adapter(model_name)
model_output = self.model.generate(input_ids=input_ids.cuda(), generation_config=generation_config)[0]
it is impossible using that way... isn't it?
I've wondered about this as well. Is there a way to have plug and play finetuned adapters on specific tasks?
if task == "chat":
self.model.set_adapter("chat_model")
elif task == "data_extraction":
self.model.set_adapter("extraction_model")
elif task == "classification":
self.model.set_adapter("classification_model")
elif task == "FAQ":
self.model.set_adapter("faq_model")
else:
self.model.disable_adapter()
@ravilkashyap you mean like that? yes. i'm using peft on triton python_backend like your way and my way too but you have to train first each lora layer and named like that
but i think, in text-generation-inference, can not using like that. because i can not receive task by request.
if you using triton python_backend, you can using switching adapter by name on gRPC streaming service and http service.
i already test done.
I have finetuned falcon-7B and merged my LORA with the pretrained model. But when I try to load it I have the following errors:
Torch: RuntimeError: weight transformer.word_embeddings.weight does not exist
Safetensors: RuntimeError: weight lm_head.weight does not exist and indeed in config there is no lm_head filed.
Any clues on what should I do ?
@PitchboyDev i think that is not in this issue agenda... i think that is simply huggingface's model problem. not peft. because your error msg is not raised in peft.
how about moving your question to huggingface forum or someting?
It's neither peft nor transformers. It's an error from this repo's code cause I can load it and use it with a transformers Gradio app. I found an issue about it: #541