mimir icon indicating copy to clipboard operation
mimir copied to clipboard

Original dataset?

Open kirklandWater1 opened this issue 1 year ago • 2 comments

Hi,

Thank you for the great work! I was not able to find the original Pile subcategory dataset (arXiv, Github etc.) in the huggingface data repo. There are only the processed ones (7-gram, 13-gram). Could you share the original ones as well?

Thank you!

kirklandWater1 avatar Apr 25 '24 03:04 kirklandWater1

Hey @kirklandWater1,

These n-gram filtered datasets are indeed the only ones available, since they are the ones we use in our work. You can use the ngram_13_0.8 split if you do not want heavy non-member filtering. If you want the original data subsets, you can get them from the Pile directly

iamgroot42 avatar Apr 25 '24 03:04 iamgroot42

Thank you for the prompt response! Is the Table 1 result coming form the ngram_13_0.8 as well?

kirklandWater1 avatar Apr 25 '24 04:04 kirklandWater1