blog Llamav2 inference - confusing prompts

Hi,

It is not clear if we need to follow the prompt template for inference using pipeline as mentioned here or do we need to follow the pipeline code without special tokens as defined here.

Let's say with modified example code here:

from transformers import AutoTokenizer
import transformers

model = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model, model_max_length=3500, truncation=True)

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="auto",
)

system_prompt = 'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?'
text = f"[INST] <<SYS>>\n{system_prompt}\n<</SYS>>\n\n"

sequences = pipeline(
    text,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    return_full_text=False,
    max_new_tokens=300,
    temperature= 0.9
)

Questions:

If we need to control the length of input sequences should we initialize tokenizer with model_max_length=X, truncation=True?
Shouldn't we then also pass the tokenizer when defining pipeline as above?
If we need to also control the length of output sequences, should we pass max_new_tokens=X to pipeline?
So, model_max_length is independent of max_new_tokens? Or is it model_max_length = input_length + max_new_tokens?
In the code above, do we need to pass system_prompt or text when calling pipeline?
Does this change when we are calling dialog models like 7B-chat/13-chat/70B-chat compared to 7B/13B/70B models?
What about finetuning on our dataset? Do we need to provide the input text as prompts with special tokens for base/chat models?

Related issues here: https://github.com/huggingface/transformers/issues/4501 https://github.com/facebookresearch/llama-recipes/issues/114

Thanks in advance!

cc @pirj @osanseviero

Aug 10 '23 22:08 shubhamagarwal92

You probably meant someone else, not @pirj

Aug 11 '23 06:08 pirj

Ahh sorry for that! Your name was getting recommended by Github!

Aug 15 '23 04:08 shubhamagarwal92

cc @pcuenca and @philschmid as well here

If we need to control the length of input sequences should we initialize tokenizer with model_max_length=X, truncation=True?

Yes.

Shouldn't we then also pass the tokenizer when defining pipeline as above?

pipeline automatically picks the tokenizer of the corresponding model, so specifying the tokenizer is not needed

If we need to also control the length of output sequences, should we pass max_new_tokens=X to pipeline?

You can pass generation params as you said (but during inference, not loading). I recommend to check the docs for generation https://huggingface.co/docs/transformers/main/main_classes/text_generation to dive into the parameters.

In the code above, do we need to pass system_prompt or text when calling pipeline?

Yes, although you can get ok results without it. If you want to pass the system prompt to the chat llamas, you need to configure the prompt format as suggested in the blog post :)

I suggest to post questions in the forum too so it's easier to find for others! https://discuss.huggingface.co/

Aug 15 '23 07:08 osanseviero

blog blog copied to clipboard

Llamav2 inference - confusing prompts

blog
blog copied to clipboard