alignment-handbook icon indicating copy to clipboard operation
alignment-handbook copied to clipboard

Clarification on dataset mixer

Open deep-diver opened this issue 10 months ago • 5 comments

from the README from /scripts.

datasets_mixer:
    dataset_1: 0.5  # Use 50% of the training examples
    dataset_2: 0.66 # Use 66% of the training examples
    dataset_3: 0.10 # Use 10% of the training examples
dataset_splits:
- train_xxx         # The training splits to mix
- test_xxx          # The test splits to mix

From the comments, it looks like ONLY training samples from dataset_1, dataset_2, and dataset_3 are considered. There isn't explanation how each dataset contributes to the test_xxx split.

However, the actual implementation seems like searching the test_xxx split from all datasets specified:

https://github.com/huggingface/alignment-handbook/blob/70769f9e9ba41c7f08ba6c4ff3725441b68b7ca3/src/alignment/data.py#L225-L230

Could you please explain the relationships between multiple datasets and splits? Thank you.

deep-diver avatar Apr 18 '24 09:04 deep-diver