returnn Dataset batching like ESPnet support

trafficstars

ESPnet basically does it like this:

Sort the whole dataset. (The dataset could maybe be directly stored in a sorted way. This would speed up the random access later.)
Creates batches for the whole dataset (or just some meta information on what sequences would come together in a batch). This has only minimal padding now. And also, we know the number of batches in advance.
Shuffle the order of the batches, i.e. randomly sample from the batches.

It's not clear whether this is really better than what we do, but in any case, it would be good if we could support this scheme as well, just for comparison.

For reference, the ESPnet code:

https://github.com/espnet/espnet/blob/master/egs2/librispeech/asr1/run.sh
https://github.com/espnet/espnet/blob/master/egs2/TEMPLATE/asr1/asr.sh
- Stage 1: Calls local/data.sh, to download and extract data, create wav.scp
  - https://github.com/espnet/espnet/blob/master/egs2/librispeech/asr1/local/data.sh
  - https://github.com/espnet/espnet/blob/master/egs2/librispeech/asr1/local/data_prep.sh
- Stage 2: Speed perturbation (factors 0.9, 1.0, 1.1)
- Stage 3: (We have "feats_type == raw) Copy data, format_wav_scp.sh.
  - https://github.com/espnet/espnet/blob/master/egs2/TEMPLATE/asr1/scripts/audio/format_wav_scp.sh
  - https://github.com/espnet/espnet/blob/master/egs2/TEMPLATE/asr1/pyscripts/audio/format_wav_scp.py
  - utt2num_samples: fnum_samples.write(f"{uttid} {len(wave)}\n") (no feat dim)
- Stage 4: Filter data by min_wav_duration=0.1 (sec), max_wav_duration=30 (sec).
- Stage 5: Train BPE.
- "Data preparation is done here."
- Stage 11: ASR training
In ASR training:
- Example train config uses: batch_type: numel, batch_bins: 140000000
- build_batch_sampler creates NumElementsBatchSampler: https://github.com/espnet/espnet/blob/master/espnet2/samplers/num_elements_batch_sampler.py
  - Sort utterances ("samples") in ascending order (given first data key) (keys in code are the seq tags).
  - When bins > batch_bins and len(current_batch_keys) >= min_batch_size, make a new batch.
  - Iterate through the whole dataset (just list of seq tags with seq lengths) to create the list of batches for the whole dataset.
  - This logic is performed once at training startup.
- Batch size with above settings means, as we have 16KHz, a batch covers 140,000,000 / 16,000 = 8,750 seconds. But then, they also use 8 GPUs (num_workers: 8), and the batch is distributed across the workers (in AbsTask.build_sequence_iter_factory), i.e. per GPU, it gets 1,093.75 seconds. (Note, in comparison, our usual batch size is more like 40,000 / 100 = 400 seconds per GPU.)

The question is, how to implement this in RETURNN? Our shuffling logic currently happens before the batching, not after.

Some options:

Some offline processing of the dataset, which builds such table for the list of batches.
- The dataset sequence order would use this table, and shuffle only across batches, i.e. keep seqs of the same batch together.
- The batching logic would also use this table.
The dataset sequence ordering has another mode specifically for this. But then this must know exactly about the batching logic/parameters (batch size etc). In the batching afterwards, we probably also should have some sanity checks that the batching is as it was expected by the sequence ordering.
- We can also introduce a new data key batch_idx, which is specifically designed for this, to tell the later parts of the pipeline (e.g. BatchingIterDataPipe in PT) what to put together in one batch. When the dataset is ordered in this special way, it can make use of this special data key.
The dataset sequence ordering has another mode for this, and we also introduce a new way that the dataset sequence ordering can already prepare the batches and store them in some new list _seq_order_prepared_batches or so.
Maybe some new way for the user to provide custom sequence ordering + batching in a combined way, via some custom user function, where it is all up to the user.

Via online_shuffle_batches + laplace seq ordering, you can get similar behavior, in a much easier way (technically). However, it will add some delay at the beginning of every epoch, it also increases CPU memory consumption, and it still has a bit more zero padding than the proposed method here. Note that I had a case (large batch sizes (80k frames, 2k seqs), thus also large laplace seq ordering bucket sizes (100k)) where it was very unstable/bad without this online batch shuffling. So this shows that there are cases where this can be important.

Jan 31 '24 12:01 albertz

This batching style does not seem very scalable to me. What do they do in ESPnet when the dataset grows really large? Maybe w/ offline preprocessing it can be made scalable, but I still struggle to see how this would work w/ DistributeFilesDataset. Maybe if you ensure the data that is supposed to go into one batch always stays within one HDF/primitive file that is distributed by the dataset... Still shuffling will be slightly different.
Does any existing dataset offer custom sorting of the segments based on a dataset key currently? This batch_idx could be created in preprocessing, and if used as sort key the batching can be done efficiently. Having the key does not solve the batch shuffling issue though, if we want true shuffling and not online shuffling... which brings me to the next question:
If we assume we can build batches w/ minimal padding from the entire sorted dataset, do you think the online shuffling is "enough" and adds sufficient randomness vs. true (full) batch shuffling? Probably to replicate ESPnet fully we need the exact option as well...

Nov 11 '24 11:11 NeoLegends

I don't exactly understand your concerns with DistributeFilesDataset. We would do it the same as we do it already right now with DistributeFilesDataset: The sorting logic is handled by the inner dataset, not by DistributeFilesDataset, not across (sub)epochs. This is what you do already right now (I assume you do as well) (e.g. with laplace sorting), and this would also be the same for the suggested sorting scheme here. Thus, DistributeFilesDataset does not affect this at all.

This scheme should work for all cases where we currently use get_seq_order_for_epoch (with some sorting, e.g. laplace). It's basically just another extension to get_seq_order_for_epoch.

There is also no point in online shuffling. When you can use get_seq_order_for_epoch, that means, you can do it offline, which will be much easier and much more efficient.

I think the remaining questions here for this issue are purely technical, e.g. how exactly to do it. E.g. should this just be another scheme for get_seq_order_for_epoch, or should this be a separate function which is called afterwards? Where would we add the batch_idx?

(Note, for the case where you cannot or want not to use get_seq_order_for_epoch, or specifically not the sorting there, then you really need to do it on-the-fly, but this is a bit off topic here. I think we can discuss this in a separate issue. It would not really be the batching that we discuss here in this issue. Some sort of batch bucketing makes more sense then.)

Nov 11 '24 11:11 albertz

returnn returnn copied to clipboard

Dataset batching like ESPnet support

returnn
returnn copied to clipboard