blog
blog copied to clipboard
Llamav2 inference - confusing prompts
Hi,
It is not clear if we need to follow the prompt template for inference using pipeline as mentioned here or do we need to follow the pipeline code without special tokens as defined here.
Let's say with modified example code here:
from transformers import AutoTokenizer
import transformers
model = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model, model_max_length=3500, truncation=True)
pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
torch_dtype=torch.float16,
device_map="auto",
)
system_prompt = 'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?'
text = f"[INST] <<SYS>>\n{system_prompt}\n<</SYS>>\n\n"
sequences = pipeline(
text,
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
return_full_text=False,
max_new_tokens=300,
temperature= 0.9
)
Questions:
- If we need to control the length of input sequences should we initialize tokenizer with
model_max_length=X, truncation=True? - Shouldn't we then also pass the tokenizer when defining pipeline as above?
- If we need to also control the length of output sequences, should we pass
max_new_tokens=Xto pipeline? - So,
model_max_lengthis independent ofmax_new_tokens? Or is itmodel_max_length = input_length + max_new_tokens? - In the code above, do we need to pass
system_promptortextwhen calling pipeline? - Does this change when we are calling dialog models like
7B-chat/13-chat/70B-chatcompared to7B/13B/70Bmodels? - What about finetuning on our dataset? Do we need to provide the input text as prompts with special tokens for base/chat models?
Related issues here: https://github.com/huggingface/transformers/issues/4501 https://github.com/facebookresearch/llama-recipes/issues/114
Thanks in advance!
cc @pirj @osanseviero
You probably meant someone else, not @pirj
Ahh sorry for that! Your name was getting recommended by Github!
cc @pcuenca and @philschmid as well here
If we need to control the length of input sequences should we initialize tokenizer with model_max_length=X, truncation=True?
Yes.
Shouldn't we then also pass the tokenizer when defining pipeline as above?
pipeline automatically picks the tokenizer of the corresponding model, so specifying the tokenizer is not needed
If we need to also control the length of output sequences, should we pass max_new_tokens=X to pipeline?
You can pass generation params as you said (but during inference, not loading). I recommend to check the docs for generation https://huggingface.co/docs/transformers/main/main_classes/text_generation to dive into the parameters.
In the code above, do we need to pass system_prompt or text when calling pipeline?
Yes, although you can get ok results without it. If you want to pass the system prompt to the chat llamas, you need to configure the prompt format as suggested in the blog post :)
I suggest to post questions in the forum too so it's easier to find for others! https://discuss.huggingface.co/