mimir
mimir copied to clipboard
Original dataset?
Hi,
Thank you for the great work! I was not able to find the original Pile subcategory dataset (arXiv, Github etc.) in the huggingface data repo. There are only the processed ones (7-gram, 13-gram). Could you share the original ones as well?
Thank you!
Hey @kirklandWater1,
These n-gram filtered datasets are indeed the only ones available, since they are the ones we use in our work. You can use the ngram_13_0.8 split if you do not want heavy non-member filtering. If you want the original data subsets, you can get them from the Pile directly
Thank you for the prompt response! Is the Table 1 result coming form the ngram_13_0.8 as well?