InternEvo
InternEvo copied to clipboard
[Feature] a very simple hugging-face dataloader
Describe the feature
a very simple on-the-fly dataloader is needed to support most pubic dataset
Will you implement it?
- [X] I would like to implement this feature and create a PR!
Completed in https://github.com/InternLM/InternEvo/pull/244
- load huggingface datasets in
streamingmode, a.k.a, lazy load data samples and no need to download the whole datasets before training - on-the-fly tokenization
- support auto_resume for hf dataloader
- support packing for hf dataloader to utilize hardware efficiency