community-events
community-events copied to clipboard
The fine tuning script run_speech_recognition_seq2seq_streaming.py use interleave_datasets which will truncate the train split
The fine tuning script run_speech_recognition_seq2seq_streaming.py use interleave_dataset function to add the train and validation split together. But I think what we really want to use is concatenate_dataset, because according to the docs, the result of function interleave_dataset ends when one of the source datasets runs out of examples (the default mode). For example, if the train split has 100 entries and validation split has 10 entries, the result would contains only 10 entries from validation split and 10 from train split. That means we waste the existing train split dataset.
as example:
>>> from datasets import Dataset, interleave_datasets, concatenate_datasets
>>> d1 = Dataset.from_dict({"a": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]})
>>> d2 = Dataset.from_dict({"a": [10, 11, 12]})
>>> print(interleave_datasets([d1, d2])['a'])
[0, 10, 1, 11, 2, 12]
>>> print(concatenate_datasets([d1, d2])['a'])
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
We need to use interleave_datasets
for streaming datasets. Here we do not know the length of each dataset a-priori, and so mix them on-the-fly based on the sampling probabilities that we define, potentially truncating individual datasets when we completely iterate over one of datasets (see "stopping strategies" in the docs).
Whereas we use concatenate_datasets
for non-streaming datasets, since we know the lengths of each dataset a-priori, so can mix them entirely). See docs.
Ideally, this is the kind of logic that we want to implement, borrowed from the Distil-Whisper training code: https://github.com/huggingface/distil-whisper/blob/914dcdf3919552d5a3826a9d5db99b059ddcc16e/training/run_distillation.py#L600