torchtune icon indicating copy to clipboard operation
torchtune copied to clipboard

Llama3 ChatFormat?

Open Broyojo opened this issue 10 months ago • 12 comments

I've been trying to finetune Llama3 8b with a custom chat dataset, but there seems to be no Llama3 Chat Format class. How can I make a custom one or should I approach this a different way?

Broyojo avatar Apr 20 '24 13:04 Broyojo

Is this what you need? https://github.com/pytorch/torchtune/blob/main/torchtune/datasets/_chat.py

aldialimucaj avatar Apr 20 '24 19:04 aldialimucaj

@Broyojo This is a great question. There is no required "chat format" in the same sense as LLaMA2 where you needed to format your prompt with instruct tags, as in the Llama2ChatFormat class:

[INST] <<SYS>>
You are a helpful, respectful and honest assistant.
<</SYS>>"

I am going to Paris, what should I see? [/INST] Paris, the capital of France, is known for its stunning architecture...

Instead, the tokenizer handles appending all the special tokens. If you look at the official LLaMA3 prompt format, it's quite different.

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ user_message }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

If you look at our TikTokenTokenizer class, all of these ids are used as special tokens. So as long as you're using this tokenizer via tokenize_messages, like with the chat dataset class @aldialimucaj mentioned above, you don't need to pass in a chat format.

I am in the process of adding a tutorial to our documentation page very soon to explain these concepts, to help make things more clear.

RdoubleA avatar Apr 20 '24 21:04 RdoubleA

Okay I was banging my head to the walls for this issue today.

I used this component torchtune.datasets.chat_dataset . But it was requiring chat_format positional argument. So I had to pass something. I dug into the repo then wasn't able to found the llama3's format. So I created my own Llama3ChatFormat. It worked. Then after digging into more, I realized that I'm basically wrapping header tokens with another special header tokens :S

Then I realized that tiktoken tokenizer class works differently then sentencepiece tokenizer class. (Sentencepiece tokenizer is kind of format agnostic, meanwhile tiktoken tokenizer has one rigid format, but they both take exactly same arguments) To implement a new format for llama3, we have to create new tokenizer that implements Tokenizer.

I was thinking to go back to HF Trainer. But I spent too much time into this so I had to continue.

Then I created a dummy format for now:

class NoopChatFormat(ChatFormat):
    @classmethod
    def format(cls, sample: List[Message]) -> List[Message]:
        return sample

And appended to the file: torchtune/data/_chat_formats.py

And used this yaml config and it worked haha :)

# Dataset
dataset:
  _component_: torchtune.datasets.chat_dataset
  source: json
  conversation_style: openai
  chat_format: NoopChatFormat
  max_seq_len: 8192
  train_on_input: False
  split: train
  data_files: "conversations.jsonl"
seed: null
shuffle: True

(Note that "openai" conversation_style is something that I also implemented myself, I can open a PR for this)

I can open a PR for this NoopChatFormat OR, we can make "chat_format" argument optional. Let me know which way you want to proceed. @RdoubleA

musabgultekin avatar Apr 21 '24 13:04 musabgultekin

I think we should also document the local data file usage as a dataset. Cause it took me half an hour to configure properly.

This project is extremely promising. Keep up the great work!

musabgultekin avatar Apr 21 '24 13:04 musabgultekin

Thank you for the transparent feedback @musabgultekin, I'm sorry you had to struggle to get this to work. Truthfully, we designed the dataset classes a little too much around llama2 which DOES require a chat format as nothing is handled by the SentencePieceTokenizer, but llama3 moves all the formatting to the tokenizer.

The approach you ended up doing is exactly how we did it. On main branch, chat format should now be optional (thanks to @ebsmothers and @joecummings for anticipating this). If you clone the repo from main you should be able to use it without a chat format. Your NoOpFormat should also work as is.

Working with a local dataset is something that will be covered on the tutorial that's in progress, hoping to put it up early this week. If you haven't figured it out already, it's very similar to how you would configure it with load_dataset directly.

chat_dataset(source='csv', data_files='my_data.csv', ...)

I can open a PR in the meantime to clarify this in the docstrings.

Appreciate your patience on this and sticking through it! Let us know if there's any other way we can make this easier for you or others in the future.

Edit: and a PR for the OpenAI conversation style would be awesome, happy to take a look at that

RdoubleA avatar Apr 21 '24 16:04 RdoubleA

Thank you for the transparent feedback @musabgultekin, I'm sorry you had to struggle to get this to work. Truthfully, we designed the dataset classes a little too much around llama2 which DOES require a chat format as nothing is handled by the SentencePieceTokenizer, but llama3 moves all the formatting to the tokenizer.

The approach you ended up doing is exactly how we did it. On main branch, chat format should now be optional (thanks to @ebsmothers and @joecummings for anticipating this). If you clone the repo from main you should be able to use it without a chat format. Your NoOpFormat should also work as is.

Working with a local dataset is something that will be covered on the tutorial that's in progress, hoping to put it up early this week. If you haven't figured it out already, it's very similar to how you would configure it with load_dataset directly.

chat_dataset(source='csv', data_files='my_data.csv', ...)

I can open a PR in the meantime to clarify this in the docstrings.

Appreciate your patience on this and sticking through it! Let us know if there's any other way we can make this easier for you or others in the future.

Edit: and a PR for the OpenAI conversation style would be awesome, happy to take a look at that

How about updating the simple usage instructions in the README or some other documents? So that everyone could follow it step-by-step to implement finetuning with their custom dataset.

jacklanda avatar Apr 23 '24 08:04 jacklanda

Thanks for your patience folks, I've just added a full tutorial on template differences between Llama2 and Llama3 and how to finetune Llama3 on a custom chat dataset here: https://pytorch.org/torchtune/main/tutorials/chat.html.

Hope that brings more clarity. Please do let me know if there's something that's not clear.

@jacklanda @musabgultekin @Broyojo

RdoubleA avatar Apr 24 '24 00:04 RdoubleA

Thanks for your patience folks, I've just added a full tutorial on template differences between Llama2 and Llama3 and how to finetune Llama3 on a custom chat dataset here: https://pytorch.org/torchtune/main/tutorials/chat.html.

Hope that brings more clarity. Please do let me know if there's something that's not clear.

@jacklanda @musabgultekin @Broyojo

Great work! Thanks @RdoubleA so much!

jacklanda avatar Apr 24 '24 02:04 jacklanda

The new doc page looks really great thank you! Its much more clear now. It's also incredible that chat_format argument issue fixed. Thanks @ebsmothers !

musabgultekin avatar Apr 24 '24 06:04 musabgultekin

@Broyojo This is a great question. There is no required "chat format" in the same sense as LLaMA2 where you needed to format your prompt with instruct tags, as in the Llama2ChatFormat class:

[INST] <<SYS>>
You are a helpful, respectful and honest assistant.
<</SYS>>"

I am going to Paris, what should I see? [/INST] Paris, the capital of France, is known for its stunning architecture...

Instead, the tokenizer handles appending all the special tokens. If you look at the official LLaMA3 prompt format, it's quite different.

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ user_message }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

If you look at our TikTokenTokenizer class, all of these ids are used as special tokens. So as long as you're using this tokenizer via tokenize_messages, like with the chat dataset class @aldialimucaj mentioned above, you don't need to pass in a chat format.

I am in the process of adding a tutorial to our documentation page very soon to explain these concepts, to help make things more clear.

If this is the case, I think the generator should stop when a eot_id is generated instead of eos_id here.

HaisongDing avatar Apr 24 '24 13:04 HaisongDing

Thats set for llama2. We probably need to add a conf something like stop_tokens (Since llama3 instruct have two).

musabgultekin avatar Apr 24 '24 16:04 musabgultekin

@HaisongDing @musabgultekin thanks for pointing out the multiple stop tokens. I just opened #871 to address this, will work on cleaning it up today so we have proper support here.

ebsmothers avatar Apr 25 '24 19:04 ebsmothers

Closing this as all user questions have been addressed.

RdoubleA avatar Apr 29 '24 17:04 RdoubleA

For anyone getting a "module not found" error for the custom dataset when following the tutorial:

You need to "tune cp <recipe_name> ./<recipe_name>.py" and use that local recipe file in the "tune run ..." call, so that it resolves relative to the local directory.

MMM-J avatar May 15 '24 18:05 MMM-J