llama Is there a way to prevent llama2 from truncating the answer at the middle of sentence when it reaches the maximum token length?

Hello all,

I'm using llama2 7b chat huggingface model and I want to restrict the output token size to a specific value such as 512. While initializing the model I am setting max_new_tokens parameter as 512 as below:

llama_llm = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True, 
    task='text-generation',
    # we pass model parameters here too
    temperature=0.0,  
    **max_new_tokens=512,**  
    repetition_penalty=1.1  
)

But this time it cuts off the answer at the middle of the sentence. I want the model to produce a response that fits the length I specified. Maybe it can be done with prompt engineering. I also tried but couldn't be succesfull. Anyone knows how to solve this?

Thank you.

Aug 25 '23 11:08 merveeozbayy

Hello all,

I'm using llama2 7b chat huggingface model and I want to restrict the output token size to a specific value such as 512. While initializing the model I am setting max_new_tokens parameter as 512 as below:
llama_llm = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True, 
    task='text-generation',
    # we pass model parameters here too
    temperature=0.0,  
    **max_new_tokens=512,**  
    repetition_penalty=1.1  
)
But this time it cuts off the answer at the middle of the sentence. I want the model to produce a response that fits the length I specified. Maybe it can be done with prompt engineering. I also tried but couldn't be succesfull. Anyone knows how to solve this?

Thank you.

Looking for the same. Any help is appreciated. Thanks.

Aug 25 '23 13:08 uni-sandeep-s

I think that you can use max_length=512 😉

Aug 25 '23 15:08 ArthurZucker

I think that you can use max_length=512 😉

@ArthurZucker Hello Arthur, thank you for your answer.

Firstly, I added max_length as you said:

llama_llm = transformers.pipeline( model=model, tokenizer=tokenizer, return_full_text=True,
task='text-generation', # we pass model parameters here too temperature=0.0,
repetition_penalty=1.1, max_length=512 )

ThenI got the following warning.

Input length of input_ids is 2467, but max_length is set to 512. This can lead to unexpected behavior. You should consider increasing max_new_tokens.

Then I changed max_length to 2980 as I want to restrict only output to 512 token size.(2467+512) But the result did not change. It still cuts the response at the middle of the sentence. Have I been able to apply it in the way you mentioned?

Aug 28 '23 07:08 merveeozbayy

@merveeozbayy What about changing return_full_text to False so it does not need to repeat the prompt in the answer and only added text will be returned?

Aug 29 '23 14:08 EmanuelaBoros

Looks like you might want to use a custom stopping_criteria see here on the tranformers GH. I imagine you can set a stopping criteria to stop generation when a sentence delimiter is encountered. Note you'd have to increase the max_new_tokens (or max_length) to ensure the model wants to generate past the stopping point (it sounds like the model is stopping mid sentence because it has exhausted it's new tokens budget).

Sep 07 '23 11:09 andrewPoulton

EmanuelaBoros?

Where in the world did you get "return_full_text" from? I've been looking for a list of parameters and what each does but I found nothing? If you have more info, please share!

Sep 08 '23 18:09 DragonAngel1st

The following worked for me to prevent llama2-7b-chat-hf to stop at the middle of sentences and use special tokens [ 2277, 29937 ]

class EosListStoppingCriteria(StoppingCriteria):
    def __init__(self, eos_sequence=[2277, 29937]):
        self.eos_sequence = eos_sequence

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        last_ids = input_ids[:,-len(self.eos_sequence):].tolist()
        return self.eos_sequence in last_ids

generate_ids = finetuned_llama_model.generate(inputs['input_ids'], max_new_tokens = 1000, repetition_penalty=1.1, stopping_criteria=[EosListStoppingCriteria()])
    response = finetuned_llama_tokenizer.decode(generate_ids[0][inputs["input_ids"].shape[-1]:])
    # response = finetuned_llama_tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    print(response)

Feb 28 '24 07:02 trilokpadhi

Has there been any advancements in this?

Mar 07 '24 10:03 aajinkya1203

llama llama copied to clipboard

Is there a way to prevent llama2 from truncating the answer at the middle of sentence when it reaches the maximum token length?

llama
llama copied to clipboard