Pile Dataset

Open mshukor opened this issue 2 years ago • 1 comments

Hello,

Can you share your filtered Pile (180Gb) dataset? The paper mentions only truncation as preprocessing, can you provide more details about your filtering step? Also did you use specific subsets of Pile (Pile-CC, Wiki, Arxiv...)?

Thanks in advance

Feb 16 '23 11:02 mshukor

It is not available for us to release the processed data. You can try downloading from the official website.

Mar 15 '23 06:03 JustinLin610