Open-Assistant
Open-Assistant copied to clipboard
Download Ebooks from Project Gutenberg
https://www.gutenberg.org/ has an extensive collection of ebooks in multiple languages and formats that would make great trianing data
There is detailed legal information on which books are under public domain and which ones are copyrighted, it would be great if someone would go through these and decide which books are okay to crawl and use as training data (my understanding is that it is okay to scrape the contents as they are publicly available in a browser, but just to be sure)
https://www.gutenberg.org/ebooks/feeds.html https://www.gutenberg.org/policy/robot_access.html https://www.gutenberg.org/help/copyright.html https://www.gutenberg.org/policy/permission.html
I am currently working on a crawler notebook for this and will upload the datasets in multiple languages to huggingface.
The pg19 dataset could work well enough, though it doesn't contain the more modern books whose copyright wasn't renewed (but I'm not sure if that makes PD outside of the US).
Wikisource also seems promising, especially in terms of multilinguality.
Additionally, the Australian branch of Project Gutenberg may be of interest, as it collects books that are PD outside of the US and thus can't be posted on the main PG website.
Thanks!
Should I filter out content that isn't PD (in the US)? What is the correct way of handling such content?
Following as this is relevant for multilingual as well.