LMOps
LMOps copied to clipboard
RoBERTa Corpus
RoBERTa Corpus is a combination of multiple sources, did not perform any form of filtering? The bookcorpus dataset alone has 74M rows, but I saw that your Roberta folder is named 20M. May I ask what rules you use to filter the final data. I hope to receive your detailed description or if it is possible to publicly disclose your Roberta training data. Thank you for your help.
We didn't perform data filtering for the corpus. We construct the data by
- Combine these sources.
- Shuffle the documents.
- Tokenize them into chunks with 512 tokens.
- Split the first 20M chunks for training (in practice, we stopped tokenization until the tokenized data contains 20M chunks)