composer icon indicating copy to clipboard operation
composer copied to clipboard

Lengthy dataloader restoration with NLP datasets

Open alextrott16 opened this issue 2 years ago • 2 comments

In trainer._train_loop getting the dataloaders back to the proper minibatch index is very time consuming when using giant NLP datasets (e.g., for training BERT). The dataset is so large that normal training only requires 1 epoch.

Perhaps we can get the same functionality as the below lines of code without having to cycle through a potentially huge amount of data?

for batch_idx, self.state.batch in enumerate(self._iter_dataloader()):

    # if resuming, skip dataloader forward to the minibatch index
    if batch_idx < int(self.state.timestamp.batch_in_epoch):
        # Restore the RNG state immediately before the next batch is yielded from the dataloader
        if batch_idx + 1 == int(self.state.timestamp.batch_in_epoch) and self._rng_state is not None:
            reproducibility.load_rng_state(self._rng_state)
            self._rng_state = None
        continue

alextrott16 avatar May 26 '22 01:05 alextrott16

(note: I think part of this is that it is streaming all of the dataset in, instead of fast-forwarding without downloading every sample)

moinnadeem avatar May 26 '22 01:05 moinnadeem

The fast-fowarding behavior is required for CV, since it needs to replay the random augmentations to set the RNG state properly. We could add in a specific shortcut for NLP to our streaming dataloader, but without the reproducibility guarantees if your dataloader has any RNG-invocation.

hanlint avatar May 31 '22 23:05 hanlint

Closing since this isn't really something we can fix in Composer -- it's a fundamental dataloader problem. I hear https://github.com/mosaicml/streaming will have instantaneous mid epoch resumption soon though O_O

mvpatel2000 avatar Nov 03 '22 03:11 mvpatel2000