Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Download Ebooks from Project Gutenberg

Open sedthh opened this issue 2 years ago • 5 comments
trafficstars

https://www.gutenberg.org/ has an extensive collection of ebooks in multiple languages and formats that would make great trianing data

sedthh avatar Feb 04 '23 18:02 sedthh

There is detailed legal information on which books are under public domain and which ones are copyrighted, it would be great if someone would go through these and decide which books are okay to crawl and use as training data (my understanding is that it is okay to scrape the contents as they are publicly available in a browser, but just to be sure)

https://www.gutenberg.org/ebooks/feeds.html https://www.gutenberg.org/policy/robot_access.html https://www.gutenberg.org/help/copyright.html https://www.gutenberg.org/policy/permission.html

sedthh avatar Feb 04 '23 18:02 sedthh

I am currently working on a crawler notebook for this and will upload the datasets in multiple languages to huggingface.

sedthh avatar Feb 04 '23 18:02 sedthh

The pg19 dataset could work well enough, though it doesn't contain the more modern books whose copyright wasn't renewed (but I'm not sure if that makes PD outside of the US).

Wikisource also seems promising, especially in terms of multilinguality.

Additionally, the Australian branch of Project Gutenberg may be of interest, as it collects books that are PD outside of the US and thus can't be posted on the main PG website.

hecko-yes avatar Feb 05 '23 00:02 hecko-yes

Thanks!

Should I filter out content that isn't PD (in the US)? What is the correct way of handling such content?

sedthh avatar Feb 05 '23 00:02 sedthh

Following as this is relevant for multilingual as well.

pruksmhc avatar Feb 05 '23 04:02 pruksmhc