peft icon indicating copy to clipboard operation
peft copied to clipboard

example on converting PEFT+INT8 trained model to ONNX for faster inference

Open pacman100 opened this issue 2 years ago • 8 comments

What does this PR do?

  1. Adds an example on converting PEFT+INT8 trained model to ONNX for faster inference. The example depicted is for Whsiper-large-V2 model.

pacman100 avatar Feb 21 '23 19:02 pacman100

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@pacman100 good to merge no?

sayakpaul avatar Mar 23 '23 09:03 sayakpaul

Hi!

Thanks for the PR. I was trying your code with a Bloomz7b1 model but it seems that the model.eval() does not merge the lora weights into the base model weights (as the comment say). Indeed, the state dict of the Bloomz7b1 with LoRA weights still has the lora keys and its self-attention weights are equal to the original Bloomz7b1 model.

Moreover, diving deep into the model.eval() method, I can see it sets the lora weights in evaluation mode but I cannot where is the implementation of the merge (that should be a matrix summation with between the self-attention weights and the lora matrices, as I understand from the original paper).

Any explanation that could help understand how to properly merge the lora weights and return a model with the default (without lora) state dict?

Thanks!

ccasimiro88 avatar Mar 24 '23 12:03 ccasimiro88

Hello, it does merge post the pr #117, could you provide an minimal example we can run if you are still encountering the issue with latest PEFT V0.2.0?

After loading the model via PeftModel.from_pretrained(), you need to do model.eval(). When you run model.eval(), it internally calls model.train(model=False) on all of its children, as such lora layer's eval() isn't called and instead train(mode=False) is called and hence the PR #117 fixing this.

pacman100 avatar Mar 24 '23 13:03 pacman100

Hi again!

I tried to follow your instructions but when calling the model.eval() method I got the following error:

RuntimeError: Expected 4-dimensional input for 4-dimensional weight [8192, 16, 1, 1], but got 3-dimensional input of size [1, 32, 4096] instead

Here's the snippet to reproduce the error:

import torch
from peft import LoraConfig, PeftConfig, PeftModelForCausalLM
from transformers import AutoModelForCausalLM, AutoTokenizer, BloomForCausalLM


peft_model_id = "mrm8488/Alpacoom"

peft_config = PeftConfig.from_pretrained(peft_model_id)


tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-7b1")

base_model = AutoModelForCausalLM.from_pretrained(
    peft_config.base_model_name_or_path,
    load_in_8bit=False,
    torch_dtype=torch.float16,
    device_map={"": "cpu"},
)


lora_model = PeftModelForCausalLM.from_pretrained(
    base_model,
    peft_model_id,
)


lora_model.eval()

Would appreciate your answer!

ccasimiro88 avatar Mar 24 '23 16:03 ccasimiro88

Hello @ccasimiro88, I'm unable to reproduce error using the latest main branch of PEFT, could you please try it and let us know.

Code:

import torch
from peft import LoraConfig, PeftConfig, PeftModelForCausalLM
from transformers import AutoModelForCausalLM, AutoTokenizer, BloomForCausalLM


peft_model_id = "mrm8488/Alpacoom"

peft_config = PeftConfig.from_pretrained(peft_model_id)


tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-7b1")

base_model = AutoModelForCausalLM.from_pretrained(
    peft_config.base_model_name_or_path,
    load_in_8bit=False,
    torch_dtype=torch.float16,
    device_map={"": "cpu"},
)


lora_model = PeftModelForCausalLM.from_pretrained(
    base_model,
    peft_model_id,
)


lora_model.eval()
lora_model.half()
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig

# Based on the inference code by `tloen/alpaca-lora`
def generate_prompt(instruction, input=None):
    if input:
        return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Input:
{input}
### Response:"""
    else:
        return f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Response:"""

def generate(
        instruction,
        input=None,
        temperature=0.1,
        top_p=0.75,
        top_k=40,
        num_beams=4,
        **kwargs,
):
    prompt = generate_prompt(instruction, input)
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].cuda()
    generation_config = GenerationConfig(
        temperature=temperature,
        top_p=top_p,
        top_k=top_k,
        num_beams=num_beams,
        **kwargs,
    )
    with torch.no_grad():
        generation_output = lora_model.generate(
            input_ids=input_ids,
            generation_config=generation_config,
            return_dict_in_generate=True,
            output_scores=True,
            max_new_tokens=256,
        )
    s = generation_output.sequences[0]
    output = tokenizer.decode(s)
    return output.split("### Response:")[1].strip().split("Below")[0]

instruction = "Tell me about alpacas"

print("Instruction:", instruction)
print("Response:", generate(instruction))

Output:

Instruction: Tell me about alpacas
Response: Alpacas are a type of llama-like animal native to the Andes Mountains of South America. They are known for their long, fluffy coats, which are used for clothing, bedding, and other items. Alpacas are also used for their wool, which is soft and warm.

pacman100 avatar Mar 24 '23 19:03 pacman100

Hi @pacman100, I installed the repo from source (main branch) and now it works. Many thanks!

ccasimiro88 avatar Mar 27 '23 10:03 ccasimiro88

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

github-actions[bot] avatar Apr 20 '23 15:04 github-actions[bot]