RedPajama-Data icon indicating copy to clipboard operation
RedPajama-Data copied to clipboard

How the 5 dumps of Common Crawl are selected?

Open Stanislas0 opened this issue 1 year ago • 1 comments

When exploring the RedPajama dataset, I found that you have selected five dumps of Common Crawl as the following:

common_crawl/2023-06 common_crawl/2020-05 common_crawl/2021-04 common_crawl/2022-05 common_crawl/2019-30

What are the criteria for selection? Considering that there are many more dumps available in Common Crawl. Could you please provide more information? Thanks a lot!

Stanislas0 avatar Apr 26 '23 09:04 Stanislas0

Hi @Stanislas0 ! Great question. We tried to cover five different years (similar to the LLaMa recipe). In addition we also aim to minimize overlap between the different dumps -- here's annoverview over the monthly overlaps: https://commoncrawl.github.io/cc-crawl-statistics/plots/crawloverlap

mauriceweber avatar Apr 29 '23 11:04 mauriceweber

Has there been any attempt to use data from the same interval (2017 to 2020) used in the Llama paper?

nbcc avatar May 30 '23 09:05 nbcc