OFA icon indicating copy to clipboard operation
OFA copied to clipboard

Pile Dataset

Open mshukor opened this issue 2 years ago • 1 comments

Hello,

Can you share your filtered Pile (180Gb) dataset? The paper mentions only truncation as preprocessing, can you provide more details about your filtering step? Also did you use specific subsets of Pile (Pile-CC, Wiki, Arxiv...)?

Thanks in advance

mshukor avatar Feb 16 '23 11:02 mshukor

It is not available for us to release the processed data. You can try downloading from the official website.

JustinLin610 avatar Mar 15 '23 06:03 JustinLin610