RedPajama-Data
RedPajama-Data copied to clipboard
How the 5 dumps of Common Crawl are selected?
When exploring the RedPajama dataset, I found that you have selected five dumps of Common Crawl as the following:
common_crawl/2023-06 common_crawl/2020-05 common_crawl/2021-04 common_crawl/2022-05 common_crawl/2019-30
What are the criteria for selection? Considering that there are many more dumps available in Common Crawl. Could you please provide more information? Thanks a lot!
Hi @Stanislas0 ! Great question. We tried to cover five different years (similar to the LLaMa recipe). In addition we also aim to minimize overlap between the different dumps -- here's annoverview over the monthly overlaps: https://commoncrawl.github.io/cc-crawl-statistics/plots/crawloverlap
Has there been any attempt to use data from the same interval (2017 to 2020) used in the Llama paper?