returnn icon indicating copy to clipboard operation
returnn copied to clipboard

Dataset batching like ESPnet support

Open albertz opened this issue 1 year ago • 0 comments

ESPnet basically does it like this:

  • Sort the whole dataset. (The dataset could maybe be directly stored in a sorted way. This would speed up the random access later.)
  • Creates batches for the whole dataset (or just some meta information on what sequences would come together in a batch). This has only minimal padding now. And also, we know the number of batches in advance.
  • Shuffle the order of the batches, i.e. randomly sample from the batches.

It's not clear whether this is really better than what we do, but in any case, it would be good if we could support this scheme as well, just for comparison.

For reference, the ESPnet code:

  • https://github.com/espnet/espnet/blob/master/egs2/librispeech/asr1/run.sh
  • https://github.com/espnet/espnet/blob/master/egs2/TEMPLATE/asr1/asr.sh
    • Stage 1: Calls local/data.sh, to download and extract data, create wav.scp
      • https://github.com/espnet/espnet/blob/master/egs2/librispeech/asr1/local/data.sh
      • https://github.com/espnet/espnet/blob/master/egs2/librispeech/asr1/local/data_prep.sh
    • Stage 2: Speed perturbation (factors 0.9, 1.0, 1.1)
    • Stage 3: (We have "feats_type == raw) Copy data, format_wav_scp.sh.
      • https://github.com/espnet/espnet/blob/master/egs2/TEMPLATE/asr1/scripts/audio/format_wav_scp.sh
      • https://github.com/espnet/espnet/blob/master/egs2/TEMPLATE/asr1/pyscripts/audio/format_wav_scp.py
      • utt2num_samples: fnum_samples.write(f"{uttid} {len(wave)}\n") (no feat dim)
    • Stage 4: Filter data by min_wav_duration=0.1 (sec), max_wav_duration=30 (sec).
    • Stage 5: Train BPE.
    • "Data preparation is done here."
    • Stage 11: ASR training
  • In ASR training:
    • Example train config uses: batch_type: numel, batch_bins: 140000000
    • build_batch_sampler creates NumElementsBatchSampler: https://github.com/espnet/espnet/blob/master/espnet2/samplers/num_elements_batch_sampler.py
      • Sort utterances ("samples") in ascending order (given first data key) (keys in code are the seq tags).
      • When bins > batch_bins and len(current_batch_keys) >= min_batch_size, make a new batch.
      • Iterate through the whole dataset (just list of seq tags with seq lengths) to create the list of batches for the whole dataset.
      • This logic is performed once at training startup.
    • Batch size with above settings means, as we have 16KHz, a batch covers 140,000,000 / 16,000 = 8,750 seconds. But then, they also use 8 GPUs (num_workers: 8), and the batch is distributed across the workers (in AbsTask.build_sequence_iter_factory), i.e. per GPU, it gets 1,093.75 seconds. (Note, in comparison, our usual batch size is more like 40,000 / 100 = 400 seconds per GPU.)

The question is, how to implement this in RETURNN? Our shuffling logic currently happens before the batching, not after.

Some options:

  • Some offline processing of the dataset, which builds such table for the list of batches.
    • The dataset sequence order would use this table, and shuffle only across batches, i.e. keep seqs of the same batch together.
    • The batching logic would also use this table.
  • The dataset sequence ordering has another mode specifically for this. But then this must know exactly about the batching logic/parameters (batch size etc). In the batching afterwards, we probably also should have some sanity checks that the batching is as it was expected by the sequence ordering.
  • The dataset sequence ordering has another mode for this, and we also introduce a new way that the dataset sequence ordering can already prepare the batches and store them in some new list _seq_order_prepared_batches or so.
  • Maybe some new way for the user to provide custom sequence ordering + batching in a combined way, via some custom user function, where it is all up to the user.

albertz avatar Jan 31 '24 12:01 albertz