alpaca-lora
alpaca-lora copied to clipboard
anyone tried batch inference?
when I set pad token 0 and padding=True,
the generated text for the padded prompt shows always
padding_side="left" do the trick
I am getting the following error when trying batched inference. Did you need any trick?
../aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [4,0,0], thread: [44,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
@benob You could do something like below
def evaluate(instructions, input=None):
prompts = [generate_prompt(instructions) for instruction in instructions]
encodings = tokenizer(prompts, return_tensors="pt", padding=True).to('cuda')
# input_ids = inputs["input_ids"].cuda()
generation_outputs = model.generate(
**encodings,
generation_config=generation_config,
max_new_tokens=256
)
returns tokenizer.batch_decode(generation_outputs)
created the gradio app for this.
https://github.com/deep-diver/Alpaca-LoRA-Serve
Thanks, the problem came from elsewhere. Note that I had to use
tokenizer = LlamaTokenizer.from_pretrained(config.backbone, padding_side='left')
Hello, I have tried batch decoding. And I set
tokenizer.padding_side = "left" \\ tokenizer.pad_token_id = tokenizer.bos_token_id
But the inference result is quite different with batch_size=1. When batch_size>1, the end of the output shows a sequence of "????" eg. Hello. ?? ?? ??
Do you have any idea about this problem?
the end of the output shows a sequence of "????" eg. Hello. ?? ?? ??
I had the same, using batch decoding and beam search with multiple beams.
I ended up filtering out those (fake) ? by .replace("\u2047", "").strip() the outputs.
@benob You could do something like below
def evaluate(instructions, input=None): prompts = [generate_prompt(instructions) for instruction in instructions] encodings = tokenizer(prompts, return_tensors="pt", padding=True).to('cuda') # input_ids = inputs["input_ids"].cuda() generation_outputs = model.generate( **encodings, generation_config=generation_config, max_new_tokens=256 ) returns tokenizer.batch_decode(generation_outputs)
Hello @deep-diver , I tried batch decoding according to your settings, which is very helpful for performance. But I found a strange phenomenon. Suppose you have four pieces of content, and the results you generate for them are different from those you batch decode them at once.I asked detailed questions in the huggingface discussion area. I'll copy him here later. https://discuss.huggingface.co/t/results-of-model-generate-are-different-for-different-batch-sizes-of-the-decode-only-model/34878 You Can Try This in Notebook
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import torch
from peft import PeftModel
import transformers
import gradio as gr
assert (
"LlamaTokenizer" in transformers._import_structure["models.llama"]
), "LLaMA is now in HuggingFace's main branch.\nPlease reinstall it: pip uninstall transformers && pip install git+https://github.com/huggingface/transformers.git"
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig
tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf", cache_dir="./cache/")
BASE_MODEL = "decapoda-research/llama-7b-hf"
LORA_WEIGHTS = "tloen/alpaca-lora-7b"
if torch.cuda.is_available():
device = "cuda"
model = LlamaForCausalLM.from_pretrained(
BASE_MODEL,
#FINETURED_MODEL,
load_in_8bit=True,
torch_dtype=torch.float16,
device_map="auto", cache_dir="./cache/"
)
model = PeftModel.from_pretrained(model, LORA_WEIGHTS, torch_dtype=torch.float16)
model.eval()
if torch.__version__ >= "2":
model = torch.compile(model)
inputs = tokenizer("prompt", return_tensors="pt")
input_ids = inputs["input_ids"].to(device)
input_ids,inputs
(tensor([[ 1, 9508]], device='cuda:0'), {'input_ids': tensor([[ 1, 9508]]), 'attention_mask': tensor([[1, 1]])})
if tokenizer.pad_token is None:
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model.resize_token_embeddings(len(tokenizer))
inputs_b = tokenizer(["prompt","prompt","prompt"], return_tensors="pt", padding=True).to(device)
input_idsb=inputs_b["input_ids"].to(device)
input_idsb,inputs_b
(tensor([[ 1, 9508], [ 1, 9508], [ 1, 9508]], device='cuda:0'), {'input_ids': tensor([[ 1, 9508], [ 1, 9508], [ 1, 9508]], device='cuda:0'), 'attention_mask': tensor([[1, 1], [1, 1], [1, 1]], device='cuda:0')})
generation_config = GenerationConfig(
temperature=1,
top_p=1,
top_k=50,
num_beams=1,
max_new_tokens=128,
)
with torch.no_grad():
generation_output = model.generate(
input_ids=input_ids,
generation_config=generation_config,
return_dict_in_generate=True,
output_scores=True,
)
generation_output
Output exceeds the size limit. Open the full output data in a text editor GreedySearchDecoderOnlyOutput(sequences=tensor([[ 1, 9508, 368, 322, 29497, 29889, 13, 1576, 6938, 4091, 451, 367, 619, 519, 304, 278, 21886, 363, 738, 6410, 470, 18658, 17654, 491, 278, 21886, 408, 263, 1121, 310, 738, 9055, 297, 278, 28289, 310, 278, 7197, 29879, 313,
s = generation_output.sequences[0]
output = tokenizer.decode(s)
output
" promptly and efficiently.\nThe Company shall not be liable to the Customer for any loss or damage suffered by the Customer as a result of any delay in the delivery of the Goods (even if caused by the Company's negligence) unless the Customer has given written notice to the Company of the delay within 7 days of the date when the Goods were due to be delivered.\nThe Company shall not be liable to the Customer for any loss or damage suffered by the Customer as a result of any delay in the delivery of the Goods (even if caused by the Company's negligence) unless the"
generation_config = GenerationConfig(
temperature=1,
top_p=1,
top_k=50,
num_beams=1,
max_new_tokens=128,
)
with torch.no_grad():
generation_output = model.generate(
input_ids=input_idsb,
generation_config=generation_config,
return_dict_in_generate=True,
output_scores=True,
)
generation_output
Output exceeds the size limit. Open the full output data in a text editor GreedySearchDecoderOnlyOutput(sequences=tensor([[ 1, 9508, 368, 322, 29497, 29889, 13, 1576, 6938, 338, 19355, 304, 5662, 3864, 393, 727, 338, 694, 5400, 8370, 1201, 470, 5199, 1020, 600, 860, 292, 297, 967, 11421, 521, 2708, 470, 297, 738, 760, 310, 967, 5381, 29889, 450, 6938, 5936, 4637, 393, 372, 756, 263, 23134, 304,
s = generation_output.sequences
output = tokenizer.batch_decode(s, skip_special_tokens=True)
output
[' promptly and efficiently.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in its supply chains or in any part of its business. The Company recognises that it has a responsibility to be proactive in ensuring that modern slavery is not taking place within its business or in its supply chains.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in its supply chains or in any part of its business.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in', ' promptly and efficiently.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in its supply chains or in any part of its business. The Company recognises that it has a responsibility to be proactive in ensuring that modern slavery is not taking place within its business or in its supply chains.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in its supply chains or in any part of its business.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in', ' promptly and efficiently.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in its supply chains or in any part of its business. The Company recognises that it has a responsibility to be proactive in ensuring that modern slavery is not taking place within its business or in its supply chains.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in its supply chains or in any part of its business.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in']
@benob You could do something like below
def evaluate(instructions, input=None): prompts = [generate_prompt(instructions) for instruction in instructions] encodings = tokenizer(prompts, return_tensors="pt", padding=True).to('cuda') # input_ids = inputs["input_ids"].cuda() generation_outputs = model.generate( **encodings, generation_config=generation_config, max_new_tokens=256 ) returns tokenizer.batch_decode(generation_outputs)
I was using that few days ago and it was working fine. But now when generating with batch_size > 1, I get this error:
File "/llama-training/lora_finetuning/lora_finetuning/inference.py", line 156, in __call__
generation_output = self.model.generate(
File "/usr/local/lib/python3.10/dist-packages/peft/peft_model.py", line 627, in generate
outputs = self.base_model.generate(**kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1524, in generate
return self.beam_search(
File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2810, in beam_search
outputs = self(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 687, in forward
outputs = self.model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 577, in forward
layer_outputs = decoder_layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 289, in forward
hidden_states = self.input_layernorm(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 84, in forward
variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True)
RuntimeError: CUDA error: device-side assert triggered
Anyone had the same error and know how to fix? (I suspect some version update in peft or transformers library)
mark, meet same issue in my side...