cc_net CC-100 in statmt version is different from paper

CC-100 in statmt version is different from paper

Open nbqu opened this issue 1 year ago • 0 comments

Hi, first of all, thank you for your great work on multilingual NLP. I'm trying to replicate XLM-R in my own reasearch, and I found that the corpus from statmt is very different from the description in XLM-R paper. For example, in the case of Esperanto, there are 157M tokens in the paper, but in the statmt version there are actually about 290M tokens. I tokenized with both sentencepiece + fairseq-preprocess and transformers tokenizer (xlm-roberta-base) for double-checking.

I guess the content of the corpus would be similar (know that CC was based on web-scrapping) since they have similar file size (which is 0.9GiB), but what makes them so different?

Apr 25 '23 15:04 nbqu

cc_net cc_net copied to clipboard

CC-100 in statmt version is different from paper

cc_net
cc_net copied to clipboard