llama
llama copied to clipboard
Is there a way to prevent llama2 from truncating the answer at the middle of sentence when it reaches the maximum token length?
Hello all,
I'm using llama2 7b chat huggingface model and I want to restrict the output token size to a specific value such as 512. While initializing the model I am setting max_new_tokens parameter as 512 as below:
llama_llm = transformers.pipeline(
model=model, tokenizer=tokenizer,
return_full_text=True,
task='text-generation',
# we pass model parameters here too
temperature=0.0,
**max_new_tokens=512,**
repetition_penalty=1.1
)
But this time it cuts off the answer at the middle of the sentence. I want the model to produce a response that fits the length I specified. Maybe it can be done with prompt engineering. I also tried but couldn't be succesfull. Anyone knows how to solve this?
Thank you.
Hello all,
I'm using llama2 7b chat huggingface model and I want to restrict the output token size to a specific value such as 512. While initializing the model I am setting max_new_tokens parameter as 512 as below:
llama_llm = transformers.pipeline( model=model, tokenizer=tokenizer, return_full_text=True, task='text-generation', # we pass model parameters here too temperature=0.0, **max_new_tokens=512,** repetition_penalty=1.1 )
But this time it cuts off the answer at the middle of the sentence. I want the model to produce a response that fits the length I specified. Maybe it can be done with prompt engineering. I also tried but couldn't be succesfull. Anyone knows how to solve this?
Thank you.
Looking for the same. Any help is appreciated. Thanks.
I think that you can use max_length=512
😉
I think that you can use
max_length=512
😉
@ArthurZucker Hello Arthur, thank you for your answer.
Firstly, I added max_length as you said:
llama_llm = transformers.pipeline(
model=model, tokenizer=tokenizer,
return_full_text=True,
task='text-generation',
# we pass model parameters here too
temperature=0.0,
repetition_penalty=1.1,
max_length=512
)
ThenI got the following warning.
Input length of input_ids is 2467, but max_length
is set to 512. This can lead to unexpected behavior. You should consider increasing max_new_tokens
.
Then I changed max_length to 2980 as I want to restrict only output to 512 token size.(2467+512) But the result did not change. It still cuts the response at the middle of the sentence. Have I been able to apply it in the way you mentioned?
@merveeozbayy What about changing return_full_text
to False so it does not need to repeat the prompt in the answer and only added text will be returned?
Looks like you might want to use a custom stopping_criteria
see here on the tranformers GH. I imagine you can set a stopping criteria to stop generation when a sentence delimiter is encountered. Note you'd have to increase the max_new_tokens (or max_length) to ensure the model wants to generate past the stopping point (it sounds like the model is stopping mid sentence because it has exhausted it's new tokens budget).
EmanuelaBoros?
Where in the world did you get "return_full_text" from? I've been looking for a list of parameters and what each does but I found nothing? If you have more info, please share!
The following worked for me to prevent llama2-7b-chat-hf to stop at the middle of sentences and use special tokens [ 2277, 29937 ]
class EosListStoppingCriteria(StoppingCriteria):
def __init__(self, eos_sequence=[2277, 29937]):
self.eos_sequence = eos_sequence
def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
last_ids = input_ids[:,-len(self.eos_sequence):].tolist()
return self.eos_sequence in last_ids
generate_ids = finetuned_llama_model.generate(inputs['input_ids'], max_new_tokens = 1000, repetition_penalty=1.1, stopping_criteria=[EosListStoppingCriteria()])
response = finetuned_llama_tokenizer.decode(generate_ids[0][inputs["input_ids"].shape[-1]:])
# response = finetuned_llama_tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(response)
Has there been any advancements in this?