LMOps RoBERTa Corpus

RoBERTa Corpus

Open stephencurry-web opened this issue 10 months ago • 1 comments

RoBERTa Corpus is a combination of multiple sources, did not perform any form of filtering? The bookcorpus dataset alone has 74M rows, but I saw that your Roberta folder is named 20M. May I ask what rules you use to filter the final data. I hope to receive your detailed description or if it is possible to publicly disclose your Roberta training data. Thank you for your help.

Apr 01 '24 07:04 stephencurry-web

We didn't perform data filtering for the corpus. We construct the data by

Combine these sources.
Shuffle the documents.
Tokenize them into chunks with 512 tokens.
Split the first 20M chunks for training (in practice, we stopped tokenization until the tokenized data contains 20M chunks)

Apr 02 '24 02:04 t1101675

LMOps LMOps copied to clipboard

RoBERTa Corpus

LMOps
LMOps copied to clipboard