DeepSpeedExamples
DeepSpeedExamples copied to clipboard
Which Wikipedia and BookCorpus datasets to use for Bert-pretraining example?
I am trying to follow the example here
https://www.deepspeed.ai/tutorials/bert-pretraining/
The section on getting the datasets says 'Note: Downloading and pre-processing instructions are coming soon.'.
I tried googling but those datasets seem tricky to find. And even then, I'm not sure if they would be the correct versions to use for the script.
Have you already found the procedure to perform data preprocessing?