the-pile
the-pile copied to clipboard
In the readme, there is a link that if followed produces a 404. See below. THIS REPO IS PROBABLY NOT WHAT YOU ARE LOOKING FOR. A copy of the Pile...
Hello, I sincerely appreciate your excellent work. Due to copyright issues with the pile, it's no longer possible to download the 825GB of data, so I can't use the decontamination...
Hi, thank you very much for releasing this great dataset. I am wondering if the **original PILE dataset** (with 30 chunks) have already shuffled? Or do we still need to...
Hi there, I followed the [GitHub downloader](https://github.com/EleutherAI/github-downloader/tree/master) repository and executed the [download_repo_text.py](https://github.com/EleutherAI/github-downloader/blob/master/download_repo_text.py) script. I obtained a total of 27,819,203 documents, just half of the documents reported here: https://github.com/EleutherAI/the-pile/blob/df97f8651ae3da658b19659b3ceaa6a34b0fc014/the_pile/datasets.py#L704 I fixed...
Hi, Apologies if this is not the right place to note this but after downloading and exploring the preprocessed GitHub part of The Pile I've noted the metadata `file_name` are...
The size of pile is too big for me. I just want to download the "Github" code data. But the number of Pile train file is 30. I would like...
Thanks for the great work! The download link for book3 is not available, will it be updated later?
Thank you for your contribution. I was trying to access the source data of GitHub, but suddenly https://the-eye.eu/public/AI/pile_preliminary_components/ is no longer accessible. A few days ago, I was able to...
Is it possible to get book3 metadata, specifically, which are fiction and which are not?
Hi, are there any tools like [ROOTS search tool implemented in BigScience Workshop](https://huggingface.co/spaces/bigscience-data/roots-search), which can enable you to query the database efficiently, and count the frequencies of certain items? Thanks