torchtune
torchtune copied to clipboard
Llama3 ChatFormat?
I've been trying to finetune Llama3 8b with a custom chat dataset, but there seems to be no Llama3 Chat Format class. How can I make a custom one or should I approach this a different way?
Is this what you need? https://github.com/pytorch/torchtune/blob/main/torchtune/datasets/_chat.py
@Broyojo This is a great question. There is no required "chat format" in the same sense as LLaMA2 where you needed to format your prompt with instruct tags, as in the Llama2ChatFormat class:
[INST] <<SYS>>
You are a helpful, respectful and honest assistant.
<</SYS>>"
I am going to Paris, what should I see? [/INST] Paris, the capital of France, is known for its stunning architecture...
Instead, the tokenizer handles appending all the special tokens. If you look at the official LLaMA3 prompt format, it's quite different.
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|>
{{ user_message }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
If you look at our TikTokenTokenizer class, all of these ids are used as special tokens. So as long as you're using this tokenizer via tokenize_messages, like with the chat dataset class @aldialimucaj mentioned above, you don't need to pass in a chat format.
I am in the process of adding a tutorial to our documentation page very soon to explain these concepts, to help make things more clear.
Okay I was banging my head to the walls for this issue today.
I used this component torchtune.datasets.chat_dataset
. But it was requiring chat_format
positional argument.
So I had to pass something. I dug into the repo then wasn't able to found the llama3's format. So I created my own Llama3ChatFormat. It worked. Then after digging into more, I realized that I'm basically wrapping header tokens with another special header tokens :S
Then I realized that tiktoken tokenizer class works differently then sentencepiece tokenizer class. (Sentencepiece tokenizer is kind of format agnostic, meanwhile tiktoken tokenizer has one rigid format, but they both take exactly same arguments) To implement a new format for llama3, we have to create new tokenizer that implements Tokenizer.
I was thinking to go back to HF Trainer. But I spent too much time into this so I had to continue.
Then I created a dummy format for now:
class NoopChatFormat(ChatFormat):
@classmethod
def format(cls, sample: List[Message]) -> List[Message]:
return sample
And appended to the file: torchtune/data/_chat_formats.py
And used this yaml config and it worked haha :)
# Dataset
dataset:
_component_: torchtune.datasets.chat_dataset
source: json
conversation_style: openai
chat_format: NoopChatFormat
max_seq_len: 8192
train_on_input: False
split: train
data_files: "conversations.jsonl"
seed: null
shuffle: True
(Note that "openai" conversation_style is something that I also implemented myself, I can open a PR for this)
I can open a PR for this NoopChatFormat OR, we can make "chat_format" argument optional. Let me know which way you want to proceed. @RdoubleA
I think we should also document the local data file usage as a dataset. Cause it took me half an hour to configure properly.
This project is extremely promising. Keep up the great work!
Thank you for the transparent feedback @musabgultekin, I'm sorry you had to struggle to get this to work. Truthfully, we designed the dataset classes a little too much around llama2 which DOES require a chat format as nothing is handled by the SentencePieceTokenizer, but llama3 moves all the formatting to the tokenizer.
The approach you ended up doing is exactly how we did it. On main branch, chat format should now be optional (thanks to @ebsmothers and @joecummings for anticipating this). If you clone the repo from main you should be able to use it without a chat format. Your NoOpFormat should also work as is.
Working with a local dataset is something that will be covered on the tutorial that's in progress, hoping to put it up early this week. If you haven't figured it out already, it's very similar to how you would configure it with load_dataset directly.
chat_dataset(source='csv', data_files='my_data.csv', ...)
I can open a PR in the meantime to clarify this in the docstrings.
Appreciate your patience on this and sticking through it! Let us know if there's any other way we can make this easier for you or others in the future.
Edit: and a PR for the OpenAI conversation style would be awesome, happy to take a look at that
Thank you for the transparent feedback @musabgultekin, I'm sorry you had to struggle to get this to work. Truthfully, we designed the dataset classes a little too much around llama2 which DOES require a chat format as nothing is handled by the SentencePieceTokenizer, but llama3 moves all the formatting to the tokenizer.
The approach you ended up doing is exactly how we did it. On main branch, chat format should now be optional (thanks to @ebsmothers and @joecummings for anticipating this). If you clone the repo from main you should be able to use it without a chat format. Your NoOpFormat should also work as is.
Working with a local dataset is something that will be covered on the tutorial that's in progress, hoping to put it up early this week. If you haven't figured it out already, it's very similar to how you would configure it with load_dataset directly.
chat_dataset(source='csv', data_files='my_data.csv', ...)
I can open a PR in the meantime to clarify this in the docstrings.
Appreciate your patience on this and sticking through it! Let us know if there's any other way we can make this easier for you or others in the future.
Edit: and a PR for the OpenAI conversation style would be awesome, happy to take a look at that
How about updating the simple usage instructions in the README or some other documents? So that everyone could follow it step-by-step to implement finetuning with their custom dataset.
Thanks for your patience folks, I've just added a full tutorial on template differences between Llama2 and Llama3 and how to finetune Llama3 on a custom chat dataset here: https://pytorch.org/torchtune/main/tutorials/chat.html.
Hope that brings more clarity. Please do let me know if there's something that's not clear.
@jacklanda @musabgultekin @Broyojo
Thanks for your patience folks, I've just added a full tutorial on template differences between Llama2 and Llama3 and how to finetune Llama3 on a custom chat dataset here: https://pytorch.org/torchtune/main/tutorials/chat.html.
Hope that brings more clarity. Please do let me know if there's something that's not clear.
@jacklanda @musabgultekin @Broyojo
Great work! Thanks @RdoubleA so much!
The new doc page looks really great thank you! Its much more clear now.
It's also incredible that chat_format
argument issue fixed. Thanks @ebsmothers !
@Broyojo This is a great question. There is no required "chat format" in the same sense as LLaMA2 where you needed to format your prompt with instruct tags, as in the Llama2ChatFormat class:
[INST] <<SYS>> You are a helpful, respectful and honest assistant. <</SYS>>" I am going to Paris, what should I see? [/INST] Paris, the capital of France, is known for its stunning architecture...
Instead, the tokenizer handles appending all the special tokens. If you look at the official LLaMA3 prompt format, it's quite different.
<|begin_of_text|><|start_header_id|>system<|end_header_id|> {{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|> {{ user_message }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
If you look at our TikTokenTokenizer class, all of these ids are used as special tokens. So as long as you're using this tokenizer via tokenize_messages, like with the chat dataset class @aldialimucaj mentioned above, you don't need to pass in a chat format.
I am in the process of adding a tutorial to our documentation page very soon to explain these concepts, to help make things more clear.
If this is the case, I think the generator should stop when a eot_id
is generated instead of eos_id
here.
Thats set for llama2. We probably need to add a conf something like stop_tokens (Since llama3 instruct have two).
@HaisongDing @musabgultekin thanks for pointing out the multiple stop tokens. I just opened #871 to address this, will work on cleaning it up today so we have proper support here.
Closing this as all user questions have been addressed.
For anyone getting a "module not found" error for the custom dataset when following the tutorial:
You need to "tune cp <recipe_name> ./<recipe_name>.py" and use that local recipe file in the "tune run ..." call, so that it resolves relative to the local directory.