MixingDataset needed
It would be similar as ConcatDataset in that it combines (concatenates) multiple datasets over the sequences (unlike MetaDataset, or ConcatSeqsDataset, or so, which combine within a sequence, or multiple sequences into one).
Some datasets can anyway also already combine data, e.g. HDFDataset, OggZipDataset. You can also duplicate the files there to get different ratios. But this can be too inefficient for huge amount of data.
DistributeFilesDataset can also be helpful to deal with huge amount of data, and again you can duplicate files to achieve different ratios. But the proposed MixingDataset here would be for more fine-grained control over the mixing.
Example case I'm considering here:
- LibriSpeech data, 960h. A single OggZip file. I want to go 100 times over this, i.e. see approx ~100,000h of it during training.
- TTS data generated from LibriSpeech LM text, ~75,000h. 75 OggZip files, each ~1000h. I want to see maybe ~100,000h of it during the whole training, to have a 1:1 ratio with the LibriSpeech data.
There is one preliminary implementation in i6_experiments.users.dorian_koch.datasets.MixingDataset.MixingDataset but this is a bit problematic for various reasons: Conceptually it expects easy random access in the child datasets, which is not true in general. It also expects that it knows the exact num_seqs of each child. It's also a bit hacky.
If possible, or maybe as an option, maybe the seqs could exactly alternate in the requested ratio, such that you will really have the requested ratio in each mini batch.
So far, this issue here is mostly there to collect some thoughts on how to implement this, and maybe potentially features/options it should have.
I'm currently thinking whether we can simply achieve that via modifying the partition_epoch of the child datasets. This would be up to the user. Then the MixingDataset would simply alternate the seqs given the configured ratio. But I'm not sure how to handle any remaining left-overs in the end. We could just skip those.
Maybe we can also use get_complete_frac of the child datasets to check that we go approximately correctly through both datasets?
(cc @dorian-K @NeoLegends)
I have begun working on a second implementation of a MixingDataset in i6_experiments that doesn't require num_seqs and is able to mix more than two datasets. I'll lay out some of my thoughts below:
MixingDataset should have the option to consider the sequence length while mixing:
Let's say Dataset A has only sequences of length 2 and Dataset B has only sequences of length 1.
With a specified ratio of 1:1, the MixingDataset should not simply alternate between the two datasets like this:
['AA', 'B', 'AA', 'B', 'AA', 'B', ...]
but rather return data like this:
['AA', 'B', 'B', 'AA', 'B', 'B', 'AA', ...]
Of course there are valid use cases for both methods, but I believe the choice should be left up to the user.
The child datasets are all always in the same epoch: It is unlikely that the mixing ratio is chosen such that all child datasets finish at exactly the same time, so it would be nice if each dataset could progress at its own pace (with their own individual epoch number). But of course this makes it unreasonably more difficult to restart training at a specific epoch, so this is not feasible. Instead we keep all of them at the same epoch and progress to the next epoch when some condition is met.
When to progress to the next epoch should also be up for the user to decide: When the first child dataset finishes, all others will (likely) still have data left, so the MixingDataset needs to decide whether to end this epoch or to reset that specific dataset to start at the beginning of the epoch again (reset index back to 0) and only end the epoch when all datasets have finished at least once. This should also be for the user to decide.
Maybe I comment here for public visibility:
MixingDataset should have the option to consider the sequence length while mixing
I see your argumentation here, but I wonder whether this is really needed. For our current use case, and for many other use cases, this situation is likely not so important to care about it. (Do you agree? If not, let us collect some stats on average seq lengths of our data, to see how much they really differ.) So, if this is not so important, I would not care about this for now. Only if you think this would not really add any complexity to handle this, then it's fine.
The child datasets are all always in the same epoch
That makes sense. That certainly makes it somewhat simpler.
Or at least, what is really a requirement: The beginning of an outer epoch (epoch for the MixingDataset) must also be the beginning of an epoch in the child datasets. It's not really possible in a clean way to start somewhere in between.
However, it would be possible to cover multiple child epochs for one outer epoch, if that makes the logic simpler for us.
Instead we keep all of them at the same epoch and progress to the next epoch when some condition is met.
So it means some of the subdataset could be finished with the current epoch, but the others not, they even could still only have covered half of the epoch or so? This is problematic: It means we have no guarantee that we really cover all the data.
E.g. instead of the epoch-based processing through datasets, you could each time also randomly sample some seq index, and use that seq. But that means, after processing e.g. 10 times as much seqs as your dataset has, in the first case, you would guarantee that every seq would have been processed 10 times, but in the second case, you don't know, some seqs might have been used 20 times, some seqs even not at all. In case of the Librispeech text data, we usually do only 5 epochs in total. On even larger datasets, it is not uncommon do only do a single epoch. It is usually undesired that you do not use all the data. I have never really done exact comparisons, but I'm quite sure that it is better to exactly cover all seqs evenly (i.e. epoch-based processing, i.e. to guarantee that we visited all seq exactly once after one epoch).
I would try to minimize as much as possible. If we maybe skip a few seqs at the very end, that's maybe still tolerable (if we don't find a better way).
Btw, I have this idea in mind:
The user controls the mixing ratio via partition_epoch in the sub datasets. E.g., to demonstrate this on an example: I have two sub datasets, one is the real LS data with 1000h, the other is some TTS data with 1000h, so about the same size. I can set partition_epoch 1 for the LS data, partition_epoch 2 for the TTS data, so after one epoch, it would cover about 1500h of speech, and 1000h of it is real speech, 500h of it is TTS.
The MixingDataset would try to keep the process through the epoch of the sub datasets in sync, so that both datasets would reach the end of an epoch at approximately the same time. We can do this via get_complete_frac. The MixingDataset would always gather some new seq from that dataset which has a lower complete frac value.