alpaca-lora anyone tried batch inference?

when I set pad token 0 and padding=True, the generated text for the padded prompt shows always

Mar 16 '23 08:03 deep-diver

padding_side="left" do the trick

Mar 16 '23 08:03 deep-diver

I am getting the following error when trying batched inference. Did you need any trick?

../aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [4,0,0], thread: [44,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

Mar 16 '23 10:03 benob

@benob You could do something like below

def evaluate(instructions, input=None):
    prompts = [generate_prompt(instructions) for instruction in instructions]
    encodings = tokenizer(prompts, return_tensors="pt", padding=True).to('cuda')
    # input_ids = inputs["input_ids"].cuda()
    generation_outputs = model.generate(
        **encodings,
        generation_config=generation_config,
        max_new_tokens=256
    )

    returns tokenizer.batch_decode(generation_outputs)

Mar 16 '23 12:03 deep-diver

created the gradio app for this.

https://github.com/deep-diver/Alpaca-LoRA-Serve

Mar 16 '23 12:03 deep-diver

Thanks, the problem came from elsewhere. Note that I had to use

tokenizer = LlamaTokenizer.from_pretrained(config.backbone, padding_side='left')

Mar 17 '23 16:03 benob

Hello, I have tried batch decoding. And I set tokenizer.padding_side = "left" \\ tokenizer.pad_token_id = tokenizer.bos_token_id But the inference result is quite different with batch_size=1. When batch_size>1, the end of the output shows a sequence of "????" eg. Hello. ?? ?? ?? Do you have any idea about this problem?

Mar 23 '23 04:03 fringe-k

the end of the output shows a sequence of "????" eg. Hello. ?? ?? ??

I had the same, using batch decoding and beam search with multiple beams. I ended up filtering out those (fake) ? by .replace("\u2047", "").strip() the outputs.

Mar 24 '23 16:03 AngainorDev

@benob You could do something like below

def evaluate(instructions, input=None):
    prompts = [generate_prompt(instructions) for instruction in instructions]
    encodings = tokenizer(prompts, return_tensors="pt", padding=True).to('cuda')
    # input_ids = inputs["input_ids"].cuda()
    generation_outputs = model.generate(
        **encodings,
        generation_config=generation_config,
        max_new_tokens=256
    )

    returns tokenizer.batch_decode(generation_outputs)

Hello @deep-diver , I tried batch decoding according to your settings, which is very helpful for performance. But I found a strange phenomenon. Suppose you have four pieces of content, and the results you generate for them are different from those you batch decode them at once.I asked detailed questions in the huggingface discussion area. I'll copy him here later. https://discuss.huggingface.co/t/results-of-model-generate-are-different-for-different-batch-sizes-of-the-decode-only-model/34878 You Can Try This in Notebook

import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import torch
from peft import PeftModel
import transformers
import gradio as gr

assert (
        "LlamaTokenizer" in transformers._import_structure["models.llama"]
), "LLaMA is now in HuggingFace's main branch.\nPlease reinstall it: pip uninstall transformers && pip install git+https://github.com/huggingface/transformers.git"
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig

tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf", cache_dir="./cache/")

BASE_MODEL = "decapoda-research/llama-7b-hf"
LORA_WEIGHTS = "tloen/alpaca-lora-7b"

if torch.cuda.is_available():
    device = "cuda"
model = LlamaForCausalLM.from_pretrained(
        BASE_MODEL,
        #FINETURED_MODEL,
        load_in_8bit=True,
        torch_dtype=torch.float16,
        device_map="auto", cache_dir="./cache/"
    )
model = PeftModel.from_pretrained(model, LORA_WEIGHTS, torch_dtype=torch.float16)


model.eval()
if torch.__version__ >= "2":
    model = torch.compile(model)

inputs = tokenizer("prompt", return_tensors="pt")
input_ids = inputs["input_ids"].to(device)
input_ids,inputs

(tensor([[ 1, 9508]], device='cuda:0'), {'input_ids': tensor([[ 1, 9508]]), 'attention_mask': tensor([[1, 1]])})

if tokenizer.pad_token is None:
            tokenizer.add_special_tokens({'pad_token': '[PAD]'})
            model.resize_token_embeddings(len(tokenizer))
inputs_b = tokenizer(["prompt","prompt","prompt"], return_tensors="pt", padding=True).to(device)
input_idsb=inputs_b["input_ids"].to(device)
input_idsb,inputs_b

(tensor([[ 1, 9508], [ 1, 9508], [ 1, 9508]], device='cuda:0'), {'input_ids': tensor([[ 1, 9508], [ 1, 9508], [ 1, 9508]], device='cuda:0'), 'attention_mask': tensor([[1, 1], [1, 1], [1, 1]], device='cuda:0')})

generation_config = GenerationConfig(
    temperature=1,
    top_p=1,
    top_k=50,
    num_beams=1,
    max_new_tokens=128,
)
with torch.no_grad():
    generation_output = model.generate(
        input_ids=input_ids,
        generation_config=generation_config,
        return_dict_in_generate=True,
        output_scores=True,
    )
generation_output

Output exceeds the size limit. Open the full output data in a text editor GreedySearchDecoderOnlyOutput(sequences=tensor([[ 1, 9508, 368, 322, 29497, 29889, 13, 1576, 6938, 4091, 451, 367, 619, 519, 304, 278, 21886, 363, 738, 6410, 470, 18658, 17654, 491, 278, 21886, 408, 263, 1121, 310, 738, 9055, 297, 278, 28289, 310, 278, 7197, 29879, 313,

s = generation_output.sequences[0]
output = tokenizer.decode(s)
output

" promptly and efficiently.\nThe Company shall not be liable to the Customer for any loss or damage suffered by the Customer as a result of any delay in the delivery of the Goods (even if caused by the Company's negligence) unless the Customer has given written notice to the Company of the delay within 7 days of the date when the Goods were due to be delivered.\nThe Company shall not be liable to the Customer for any loss or damage suffered by the Customer as a result of any delay in the delivery of the Goods (even if caused by the Company's negligence) unless the"

generation_config = GenerationConfig(
    temperature=1,
    top_p=1,
    top_k=50,
    num_beams=1,
    max_new_tokens=128,
)
with torch.no_grad():
    generation_output = model.generate(
        input_ids=input_idsb,
        generation_config=generation_config,
        return_dict_in_generate=True,
        output_scores=True,
    )
generation_output

Output exceeds the size limit. Open the full output data in a text editor GreedySearchDecoderOnlyOutput(sequences=tensor([[ 1, 9508, 368, 322, 29497, 29889, 13, 1576, 6938, 338, 19355, 304, 5662, 3864, 393, 727, 338, 694, 5400, 8370, 1201, 470, 5199, 1020, 600, 860, 292, 297, 967, 11421, 521, 2708, 470, 297, 738, 760, 310, 967, 5381, 29889, 450, 6938, 5936, 4637, 393, 372, 756, 263, 23134, 304,

s = generation_output.sequences
output = tokenizer.batch_decode(s, skip_special_tokens=True)
output

[' promptly and efficiently.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in its supply chains or in any part of its business. The Company recognises that it has a responsibility to be proactive in ensuring that modern slavery is not taking place within its business or in its supply chains.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in its supply chains or in any part of its business.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in', ' promptly and efficiently.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in its supply chains or in any part of its business. The Company recognises that it has a responsibility to be proactive in ensuring that modern slavery is not taking place within its business or in its supply chains.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in its supply chains or in any part of its business.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in', ' promptly and efficiently.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in its supply chains or in any part of its business. The Company recognises that it has a responsibility to be proactive in ensuring that modern slavery is not taking place within its business or in its supply chains.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in its supply chains or in any part of its business.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in']

Mar 28 '23 08:03 T-Atlas

@benob You could do something like below

def evaluate(instructions, input=None):
    prompts = [generate_prompt(instructions) for instruction in instructions]
    encodings = tokenizer(prompts, return_tensors="pt", padding=True).to('cuda')
    # input_ids = inputs["input_ids"].cuda()
    generation_outputs = model.generate(
        **encodings,
        generation_config=generation_config,
        max_new_tokens=256
    )

    returns tokenizer.batch_decode(generation_outputs)

I was using that few days ago and it was working fine. But now when generating with batch_size > 1, I get this error:

File "/llama-training/lora_finetuning/lora_finetuning/inference.py", line 156, in __call__
    generation_output = self.model.generate(
  File "/usr/local/lib/python3.10/dist-packages/peft/peft_model.py", line 627, in generate
    outputs = self.base_model.generate(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1524, in generate
    return self.beam_search(
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2810, in beam_search
    outputs = self(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 687, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 577, in forward
    layer_outputs = decoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 289, in forward
    hidden_states = self.input_layernorm(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 84, in forward
    variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True)
RuntimeError: CUDA error: device-side assert triggered

Anyone had the same error and know how to fix? (I suspect some version update in peft or transformers library)

Apr 11 '23 22:04 louisoutin

mark, meet same issue in my side...

Jun 29 '23 01:06 smallccn

alpaca-lora alpaca-lora copied to clipboard

anyone tried batch inference?

alpaca-lora
alpaca-lora copied to clipboard