DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

Which Wikipedia and BookCorpus datasets to use for Bert-pretraining example?

Open SantoshGuptaML opened this issue 4 years ago • 1 comments

I am trying to follow the example here

https://www.deepspeed.ai/tutorials/bert-pretraining/

The section on getting the datasets says 'Note: Downloading and pre-processing instructions are coming soon.'.

I tried googling but those datasets seem tricky to find. And even then, I'm not sure if they would be the correct versions to use for the script.

SantoshGuptaML avatar Apr 21 '21 10:04 SantoshGuptaML

Have you already found the procedure to perform data preprocessing?

zyz0000 avatar Sep 02 '21 00:09 zyz0000