torchtitan
torchtitan copied to clipboard
IBM experimental dataloaders
This PR introduces an experimental PyTorch-native dataloader from IBM that is distributed, stateful, checkpointable, composable and rescalable. It is intended for use in large-scale model pretraining, particularly in research settings where rapid iteration between datasets may be required. It automatically and invisibly handles data sharding, shuffling, subdataset weighting, checkpoint saving and loading, and more, with minimal overhead and high throughput.
- Add experimental dataset source file
- Add experimental dataloader builder, hooked into torchtitan cfg
- Update torchtitan cfg with additional dataset arg fields
- Update train script to build experimental dataloader instead of hf depending on cfg flags
- Replace the existing C4-mini example dataset with one that matches the expected formatting for the experimental dataloader
- TODO: port over unit tests as well
- TODO: preprocessing script(s) for the new dataset format
- TODO: further cleanup/iteration