composer
composer copied to clipboard
Lengthy dataloader restoration with NLP datasets
In trainer._train_loop
getting the dataloaders back to the proper minibatch index is very time consuming when using giant NLP datasets (e.g., for training BERT). The dataset is so large that normal training only requires 1 epoch.
Perhaps we can get the same functionality as the below lines of code without having to cycle through a potentially huge amount of data?
for batch_idx, self.state.batch in enumerate(self._iter_dataloader()):
# if resuming, skip dataloader forward to the minibatch index
if batch_idx < int(self.state.timestamp.batch_in_epoch):
# Restore the RNG state immediately before the next batch is yielded from the dataloader
if batch_idx + 1 == int(self.state.timestamp.batch_in_epoch) and self._rng_state is not None:
reproducibility.load_rng_state(self._rng_state)
self._rng_state = None
continue
(note: I think part of this is that it is streaming all of the dataset in, instead of fast-forwarding without downloading every sample)
The fast-fowarding behavior is required for CV, since it needs to replay the random augmentations to set the RNG state properly. We could add in a specific shortcut for NLP to our streaming dataloader, but without the reproducibility guarantees if your dataloader has any RNG-invocation.
Closing since this isn't really something we can fix in Composer -- it's a fundamental dataloader problem. I hear https://github.com/mosaicml/streaming will have instantaneous mid epoch resumption soon though O_O