DeepSpeedExamples
DeepSpeedExamples copied to clipboard
Question: update preprocessing scripts to use HuggingFace datasets for pretraining?
Collecting the datasets needed for pretraining is a bit of work, especially when downloading from lots of different URLs behind a firewall.
https://github.com/microsoft/DeepSpeedExamples/tree/25d73cf73fb3dc66faefa141b7319526555be9fc/Megatron-LM-v1.1.5-ZeRO3#datasets
I see that some version of these seem to be available in HuggingFace datasets repo, like openwebtext.
https://huggingface.co/datasets/openwebtext
For the above, it's especially nice since @stas00 has a small subset one can use for testing:
https://huggingface.co/datasets/stas/openwebtext-10k
It's pretty straight-forward to extend the preprocessing script to use the HF datasets as a source rather than a json file. Would something like that be acceptable as a PR?