DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

Question: update preprocessing scripts to use HuggingFace datasets for pretraining?

Open adammoody opened this issue 2 years ago • 12 comments

Collecting the datasets needed for pretraining is a bit of work, especially when downloading from lots of different URLs behind a firewall.

https://github.com/microsoft/DeepSpeedExamples/tree/25d73cf73fb3dc66faefa141b7319526555be9fc/Megatron-LM-v1.1.5-ZeRO3#datasets

I see that some version of these seem to be available in HuggingFace datasets repo, like openwebtext.

https://huggingface.co/datasets/openwebtext

For the above, it's especially nice since @stas00 has a small subset one can use for testing:

https://huggingface.co/datasets/stas/openwebtext-10k

It's pretty straight-forward to extend the preprocessing script to use the HF datasets as a source rather than a json file. Would something like that be acceptable as a PR?

adammoody avatar Aug 03 '21 20:08 adammoody