torchtitan icon indicating copy to clipboard operation
torchtitan copied to clipboard

IBM experimental dataloaders

Open daviswer opened this issue 8 months ago • 5 comments

This PR introduces an experimental PyTorch-native dataloader from IBM that is distributed, stateful, checkpointable, composable and rescalable. It is intended for use in large-scale model pretraining, particularly in research settings where rapid iteration between datasets may be required. It automatically and invisibly handles data sharding, shuffling, subdataset weighting, checkpoint saving and loading, and more, with minimal overhead and high throughput.

  • Add experimental dataset source file
  • Add experimental dataloader builder, hooked into torchtitan cfg
  • Update torchtitan cfg with additional dataset arg fields
  • Update train script to build experimental dataloader instead of hf depending on cfg flags
  • Replace the existing C4-mini example dataset with one that matches the expected formatting for the experimental dataloader
  • TODO: port over unit tests as well
  • TODO: preprocessing script(s) for the new dataset format
  • TODO: further cleanup/iteration

daviswer avatar May 31 '24 07:05 daviswer