cc_net
cc_net copied to clipboard
CC-100 in statmt version is different from paper
Hi, first of all, thank you for your great work on multilingual NLP.
I'm trying to replicate XLM-R in my own reasearch, and I found that the corpus from statmt is very different from the description in XLM-R paper.
For example, in the case of Esperanto, there are 157M tokens in the paper, but in the statmt version there are actually about 290M tokens.
I tokenized with both sentencepiece + fairseq-preprocess and transformers tokenizer (xlm-roberta-base
) for double-checking.
I guess the content of the corpus would be similar (know that CC was based on web-scrapping) since they have similar file size (which is 0.9GiB), but what makes them so different?