Create Project Gutenberg importer
Feature Request
Classics are some of the most used books on Open Library, as exemplified by the frequent occurrence of these books in the trending carousel. Many patrons are looking for easy-to-read formats for e-readers and phones, and Project Gutenberg is a fantastic open source and open collaboration initiative that provides these books.
We currently have ~1.4k project gutenberg editions in Open Library, that have been manually added by librarians/patrons. Including records and read links for the over 75k editions available in Project Gutenberg would help make this resource more discoverable as well help patrons find accessible copies of books they are looking for.
Creating an importer that we can hook up to https://openlibrary.org/import/preview would be a good first step, that let's us work through the logic of how to fetch/marshal the data in the correct import format.
Data sources
- We do have a backup of Project Gutenberg books in the Archive. Here's a sample record: https://archive.org/details/tomsawyerdetecti00093gut
- This collection also contains a big listing of Gutenberg books, but not sure if it's complete or up-to-date or has/does not have duplicates, etc: https://archive.org/details/gutenberg
- Maybe their official website has a data dump? https://www.gutenberg.org/
- Each gutenberg ebook has an ID number, eg
93- html page url: https://www.gutenberg.org/ebooks/93
- Archive page: can maybe be found by this query: https://archive.org/search?query=call_number%3A%22gutenberg+etext%23+93%22
- Each ebook has a "zip" folder of files, with I think no consistent structure/naming: https://www.gutenberg.org/files/93/
- Phase 1: Given an gutenberg ebook ID, fetch metadata (from somewhere) and create the Open Library ImportRecord format, that can be imported via
/import/preview. - Phase 2: Look around for a bulk metadata dump? That we can use to potentially either just get a large number of ebook IDs and then use the process from phase 1, or ideally, have a bulk metadata dump that we can import directly.
Breakdown
Related files
Refer to this map of common Endpoints:
- The
/import/previewendpoint: https://github.com/internetarchive/openlibrary/blob/ffddccf5c88b3f3549f8870f512b7f3d0acaf7ed/openlibrary/plugins/importapi/import_ui.py
Requirements Checklist
Checklist of requirements that need to be satisfied in order for this issue to be closed:
- [ ]
Stakeholders
- @pidgezero-one
Instructions for Contributors
- Before creating a new branch or pushing up changes to a PR, please first run these commands to ensure your repository is up to date, as the pre-commit bot may add commits to your PRs upstream.
@pidgezero-one does this look like one you'd be interested in?
@cdrini Yessir 🫡
Oh also this might have some useful info on gutenberg metadata formats/options! https://www.gutenberg.org/help/bibliographic_record.html#Table:_Bibliographic_Record
Note that, as detailed on the metadata page:
Original publication metadata entries are only available for items published by Project Gutenberg since approximately 2022.
Before that PG claimed to be "edition-less" so they basically need to be treated as their own unique edition. Post-2022 they are more like a (proofread) OCR'd digital version of a print edition. The two cases probably need to be handled differently.
Gutenberg is a clearer case where an edition separate from the source publication should be created. They say similarly as well:
Project Gutenberg metadata does not include the original print source publication date(s). Because Project Gutenberg eBooks are substantially different from the source book(s), we track the Project Gutenberg publication date (“release date”), but do not include print source information in the metadata. Differences almost always include dehyphenation, removing page headers/footers, changes to typography during markup, and sometimes relocation of images, footnotes, captions, etc. In addition, Project Gutenberg eBooks sometimes come from multiple print editions.
https://www.gutenberg.org/ebooks/offline_catalogs.html
Actually this url looks interesting for a bulk download for phase 2!
That paragraph refers to the long time historical methodology which was in place from 1971 to 2022, but it has since changed, which is why I suggested two different import strategies are needed for the different types of editions.
That offline catalog page lists a MARC dump. Why not just have @hornc run that through the MARC importer rather than creating something custom?
Importing Project Gutenberg metadata records from bulk MARC would be easy, and as @tfmorris mentions, the functionality already exists.
What are the requirements for having a Project Gutenberg record 'Readable' for the purposes of this feature?
There are links that will likely be produced from the Project Gutenberg identifier and URLs that may be in the MARC record which will probably turn up with existing import processes, but I suspect the motivation behind this request is to make the READ button turn up on the UI.
I'm not sure of the current terminology for the READ / borrow options, or what different criteria triggers them, and I don't know where to look for documentation.
The feature description above isn't clear whether the value is in adding metadata records in general (which we can do with a MARC dump), or about integrated READ button access.
When I look at https://openlibrary.org/books/OL26581665M/Martin_Eden , the 'Read' button appears to direct to Project Gutenberg already, and I can't see anything else in the record other than the project_gutenberg identifier. It looks like this Read link-up feature is already implemented via the identifier?
Oh sweet! Just adding the project gutenberg identifier is all that's needed for the record to be readable, everything else is already wired up.
Long term, we want to push our system to allow for (1) one off imports via the /import/preview page by identifier or URL, (2) one off bulk imports via /import/batch, and (3) regularly scheduled bulk imports to keep our catalog up-to-date. Ideally sharing as much code as possible across these three for each MetadataProvider.
@hornc What would the process look like to do a bulk import of the Gutenberg Marc file? I'd love to make that process more accessible, perhaps by exposing it via the /import/batch UI. If there's a way we can work with @pidgezero-one to have more people comfortable doing that kind of work, that would be amazing!
One thing to note: I reckon we'll need to add a similar exception to prevent these records from resolving to unrelated existing records with eg ISBNs, like we did for standard ebooks (See https://github.com/internetarchive/openlibrary/issues/9372#issuecomment-2902087473 ). @pidgezero-one would you be able to tackle that piece?