InternEvo icon indicating copy to clipboard operation
InternEvo copied to clipboard

[Feature] a very simple hugging-face dataloader

Open sunpengsdu opened this issue 1 year ago • 1 comments

Describe the feature

a very simple on-the-fly dataloader is needed to support most pubic dataset

Will you implement it?

  • [X] I would like to implement this feature and create a PR!

sunpengsdu avatar Mar 21 '24 02:03 sunpengsdu

Completed in https://github.com/InternLM/InternEvo/pull/244

  1. load huggingface datasets in streaming mode, a.k.a, lazy load data samples and no need to download the whole datasets before training
  2. on-the-fly tokenization
  3. support auto_resume for hf dataloader
  4. support packing for hf dataloader to utilize hardware efficiency

zigzagcai avatar Jun 20 '24 13:06 zigzagcai