Conversation_template
Could you tell me how to use this conversation_template in the chatbot? I used a training dataset that follows the Llama-3 conversation_template, but there doesn’t seem to be an argument to set this conversation_template in the chatbot.py. Should I use --prompt_structure to include the Llama-3 template as an argument?
FYI, when training on Llama-3, should my dataset always follow its conversation_template?
Thank you so much.
Hi, first thanks for your interest in LMFlow! Regarding to your questions:
conversation_templateonly works for model training (finetuning) + conversation dataset (i.e.,"type": "conversation"in the .json file), and it is responsible for adding special tokens so that you don't need to adding those according to different models. See here for a dataset example, or you could
cd data
bash download.sh alpaca
and take the json file in train_conversation as a reference.
- For inference, you may try the following codes taken from llama hf repo for a temporary use:
import torch
from transformers import pipeline
model_id = "meta-llama/Llama-3.2-1B-Instruct"
pipe = pipeline(
"text-generation",
model=model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]
outputs = pipe(
messages,
max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])
The chatbot.py is outdated and we're planning to upgrade it. As of now, it is not compatible with instruction/chat models. Sorry for the inconvenience.
Thank you for the explanation. However, I'm still a bit confused about the conversation dataset structure. For the training dataset, should I put the templated dataset as {"type": "text_only", "instances": conversation_template}? It confuses me how I’m supposed to put data into {"type":"conversion","instances":[]} since it’s already a conversation template.
If the data is already templated, you could choose base on the expected behavior.
The reason why we design conversation dataset is that we want to not only do the tokenization and templating but also mask the user inputs, system prompts, and tool information, since model can see them all at once and there's no need to generate them autoregressively. In other words, you do not need to train_on_prompt. The conversation dataset also supports multi-round conversations, and the mask will look like [1,1,1,1,0,0,0,1,1,1,0,0,0], say, for a conversation that has two rounds.
You can use text_only dataset type if you've already organized your conversation in one string. The json file then should look like:
{
"type": "text_only",
"instances": [
{"text": "<|begin_of_text|>\n\n<|start_header_id|>system<|end_header_id|>\n\nYou are a chatbot developed by LMFlow team.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWho are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nI am a chatbot developed by LMFlow team.<|eot_id|>"},
{"text": "SOME_OTHER_TEMPLATED_TEXT_2"},
{"text": "SOME_OTHER_TEMPLATED_TEXT_3"},
]
}
However, we cannot mask on prompt in this case, since it is extremely hard to parse out the tokens that should be masked. In other words, you do train_on_prompt.
Alternatively, text2text dataset will mask all content in input. If it's a single round conversation, it should be fine (no difference between a templated text2text dataset and conversation dataset once you set conversation_template correctly).
{
"type": "text2text",
"instances": [
{
"input": "<|begin_of_text|>\n\n<|start_header_id|>system<|end_header_id|>\n\nYou are a chatbot developed by LMFlow team.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWho are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
"output": "I am a chatbot developed by LMFlow team.<|eot_id|>",
},
{
"input": "SAMPLE_INPUT_2",
"output": "SAMPLE_OUTPUT_2",
},
{
"input": "SAMPLE_INPUT_3",
"output": "SAMPLE_OUTPUT_3",
},
]
}
Thank you for your explanation. I still have a question about how to build a chatbot. Whether I use the "type": "conversation" type of conversation data or the LLAMA3 template, will it affect the way I build the chatbot? Also, the codes below seem unable to create a multi-round conversation:
model_id = "meta-llama/Llama-3.2-1B-Instruct" pipe = pipeline( "text-generation", model=model_id, torch_dtype=torch.bfloat16, device_map="auto", ) messages = [ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"}, {"role": "user", "content": "Who are you?"}, ] outputs = pipe( messages, max_new_tokens=256, ) print(outputs[0]["generated_text"][-1])