torchtitan IBM experimental dataloaders

IBM experimental dataloaders

Open daviswer opened this issue 8 months ago • 5 comments

This PR introduces an experimental PyTorch-native dataloader from IBM that is distributed, stateful, checkpointable, composable and rescalable. It is intended for use in large-scale model pretraining, particularly in research settings where rapid iteration between datasets may be required. It automatically and invisibly handles data sharding, shuffling, subdataset weighting, checkpoint saving and loading, and more, with minimal overhead and high throughput.

Add experimental dataset source file
Add experimental dataloader builder, hooked into torchtitan cfg
Update torchtitan cfg with additional dataset arg fields
Update train script to build experimental dataloader instead of hf depending on cfg flags
Replace the existing C4-mini example dataset with one that matches the expected formatting for the experimental dataloader
TODO: port over unit tests as well
TODO: preprocessing script(s) for the new dataset format
TODO: further cleanup/iteration

May 31 '24 07:05 daviswer

torchtitan torchtitan copied to clipboard

IBM experimental dataloaders

torchtitan
torchtitan copied to clipboard