LMOps icon indicating copy to clipboard operation
LMOps copied to clipboard

RoBERTa Corpus

Open stephencurry-web opened this issue 10 months ago • 1 comments

RoBERTa Corpus is a combination of multiple sources, did not perform any form of filtering? The bookcorpus dataset alone has 74M rows, but I saw that your Roberta folder is named 20M. May I ask what rules you use to filter the final data. I hope to receive your detailed description or if it is possible to publicly disclose your Roberta training data. Thank you for your help.

stephencurry-web avatar Apr 01 '24 07:04 stephencurry-web

We didn't perform data filtering for the corpus. We construct the data by

  1. Combine these sources.
  2. Shuffle the documents.
  3. Tokenize them into chunks with 512 tokens.
  4. Split the first 20M chunks for training (in practice, we stopped tokenization until the tokenized data contains 20M chunks)

t1101675 avatar Apr 02 '24 02:04 t1101675