mlx-examples icon indicating copy to clipboard operation
mlx-examples copied to clipboard

Configuration-based use of HF hub-hosted datasets for training

Open chimezie opened this issue 2 years ago • 1 comments

Per the title, allow a structured hf_dataset YAML configuration parameter for specifying an HF hub-hosted dataset (via name) to use with training and the ability to use datasets' local file system caching, named splits, named configurations (via configuration), split slicing syntax for specifying train, validation, and test datasets, etc.

The dataset feature names that correspond to attributes or keys for prompt/completion value pairs in the datasets or those with single pure text values (via text_feature in that case) can be specified.

Added YAML parameters, for example (train on the first 1K in the train split and validate with the last 100 (no test data set):

hf_dataset:
  name: "billsum"
  train_split: "train[:1000]"
  valid_split: "train[-100:]"
  prompt_feature: "text"
  completion_feature: "summary"

See: Splits and Configurations, billsum, & HF Dataset API

chimezie avatar Apr 20 '24 15:04 chimezie

Motivated by need to reproduce #620 with an open dataset

chimezie avatar Apr 20 '24 16:04 chimezie

I like this a lot.

Once everyone finetunes for the first time, they quickly see it's all about the data.

Mixing private and public HF datasets in the same config would be fantastic as well.

fblissjr avatar Jun 11 '24 13:06 fblissjr