community-events The fine tuning script run_speech_recognition_seq2seq_streaming.py use interleave_datasets which will truncate the train split

The fine tuning script run_speech_recognition_seq2seq_streaming.py use interleave_datasets which will truncate the train split

Open cahya-wirawan opened this issue 2 years ago • 2 comments

The fine tuning script run_speech_recognition_seq2seq_streaming.py use interleave_dataset function to add the train and validation split together. But I think what we really want to use is concatenate_dataset, because according to the docs, the result of function interleave_dataset ends when one of the source datasets runs out of examples (the default mode). For example, if the train split has 100 entries and validation split has 10 entries, the result would contains only 10 entries from validation split and 10 from train split. That means we waste the existing train split dataset.

as example:

>>> from datasets import Dataset, interleave_datasets, concatenate_datasets
>>> d1 = Dataset.from_dict({"a": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]})
>>> d2 = Dataset.from_dict({"a": [10, 11, 12]})
>>> print(interleave_datasets([d1, d2])['a'])
[0, 10, 1, 11, 2, 12]
>>> print(concatenate_datasets([d1, d2])['a'])
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

Dec 13 '22 07:12 cahya-wirawan

We need to use interleave_datasets for streaming datasets. Here we do not know the length of each dataset a-priori, and so mix them on-the-fly based on the sampling probabilities that we define, potentially truncating individual datasets when we completely iterate over one of datasets (see "stopping strategies" in the docs).

Whereas we use concatenate_datasets for non-streaming datasets, since we know the lengths of each dataset a-priori, so can mix them entirely). See docs.

Dec 07 '23 13:12 sanchit-gandhi

Ideally, this is the kind of logic that we want to implement, borrowed from the Distil-Whisper training code: https://github.com/huggingface/distil-whisper/blob/914dcdf3919552d5a3826a9d5db99b059ddcc16e/training/run_distillation.py#L600

Dec 07 '23 13:12 sanchit-gandhi

community-events community-events copied to clipboard

The fine tuning script run_speech_recognition_seq2seq_streaming.py use interleave_datasets which will truncate the train split

community-events
community-events copied to clipboard