Configuration-based use of HF hub-hosted datasets for training
Per the title, allow a structured hf_dataset YAML configuration parameter for specifying an HF hub-hosted dataset (via name) to use with training and the ability to use datasets' local file system caching, named splits, named configurations (via configuration), split slicing syntax for specifying train, validation, and test datasets, etc.
The dataset feature names that correspond to attributes or keys for prompt/completion value pairs in the datasets or those with single pure text values (via text_feature in that case) can be specified.
Added YAML parameters, for example (train on the first 1K in the train split and validate with the last 100 (no test data set):
hf_dataset:
name: "billsum"
train_split: "train[:1000]"
valid_split: "train[-100:]"
prompt_feature: "text"
completion_feature: "summary"
Motivated by need to reproduce #620 with an open dataset
I like this a lot.
Once everyone finetunes for the first time, they quickly see it's all about the data.
Mixing private and public HF datasets in the same config would be fantastic as well.