torchtune How to use train and test split with the recipes?

Dear torchtune team,

With sft trainer we can do train_dataset=ds["train"], eval_dataset=ds["validation"], if it is a split from huggingface dataset.

I wonder under a fine-tuning recipes with instruction dataset, how is this achieved, particular in a YAML configuration file? With current example on the tutorial: split: train, I feel that the whole dataset is used for training. Should we prepare json/csv files before hand with spitted on train/test/validation set? Thanks

Jan 01 '25 13:01 7rabbit

Hey @7rabbit - currently this isn't available through a YAML config. You're more than welcome to hack onto our recipes to add this functionality, but we also have it on our roadmap to support early this year!

For now, if you want to only train on part of the dataset you can either preprocess yourself, do it "online" through a custom transform, or specify a percentage of the dataset to use like so "train[:50%]".

Jan 06 '25 14:01 joecummings

Hi @joecummings asking for a follow up if this feature has been implemented for the yaml config?

Feb 12 '25 18:02 shaunakjoshi12

@shaunakjoshi12 we have not yet implemented this feature. If you're interested in contributing it we would be happy to review a PR!

Feb 16 '25 00:02 ebsmothers

any news on this feature please ?

Jul 03 '25 21:07 habibregask86