alignment-handbook
alignment-handbook copied to clipboard
Clarification on dataset mixer
from the README from /scripts
.
datasets_mixer:
dataset_1: 0.5 # Use 50% of the training examples
dataset_2: 0.66 # Use 66% of the training examples
dataset_3: 0.10 # Use 10% of the training examples
dataset_splits:
- train_xxx # The training splits to mix
- test_xxx # The test splits to mix
From the comments, it looks like ONLY training samples from dataset_1
, dataset_2
, and dataset_3
are considered. There isn't explanation how each dataset contributes to the test_xxx
split.
However, the actual implementation seems like searching the test_xxx
split from all datasets specified:
https://github.com/huggingface/alignment-handbook/blob/70769f9e9ba41c7f08ba6c4ff3725441b68b7ca3/src/alignment/data.py#L225-L230
Could you please explain the relationships between multiple datasets and splits? Thank you.