trl icon indicating copy to clipboard operation
trl copied to clipboard

conversational data for SFTTrainer

Open edixiong opened this issue 11 months ago • 3 comments

For SFTTrainer, if we load the dataset using a conversational form (ChatML format), the function apply_chat_template is used (https://github.com/huggingface/trl/blob/v0.7.11/trl/extras/dataset_formatting.py#L55) with tokenize=False. Later in SFTTrainer, the data is tokenized again with add_special_tokens=True. In tokenizer like LLaMATokenizer, there will be two bos tokens at the very beginning: <s><s> ..., which is not intended. Maybe we should modify dataset_kwargs at this line https://github.com/huggingface/trl/blob/v0.7.11/trl/trainer/sft_trainer.py#L246 so that dataset_kwargs['add_special_tokens']=True?

edixiong avatar Mar 07 '24 05:03 edixiong

Yes that would make sense, would you like to open a PR for the fix? cc @philschmid what do you think?

younesbelkada avatar Mar 11 '24 13:03 younesbelkada

sure I will do that

edixiong avatar Mar 13 '24 08:03 edixiong

Good idea @edixiong, thats what I currently do manually. https://www.philschmid.de/fine-tune-llms-in-2024-with-trl#4-fine-tune-llm-using-trl-and-the-sfttrainer

This probably should only be applied if the "chatml" or template is detected.

philschmid avatar Mar 13 '24 10:03 philschmid

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

github-actions[bot] avatar Apr 06 '24 15:04 github-actions[bot]