OFA
OFA copied to clipboard
Pile Dataset
Hello,
Can you share your filtered Pile (180Gb) dataset? The paper mentions only truncation as preprocessing, can you provide more details about your filtering step? Also did you use specific subsets of Pile (Pile-CC, Wiki, Arxiv...)?
Thanks in advance
It is not available for us to release the processed data. You can try downloading from the official website.