IA import errors with 500 when Independently published
Problem
Eg https://testing.openlibrary.org/import/preview?source=ia:seventhmanlargep0000maxb
In general we don't want to import Independently Published books into OL because since ~2022 that's been and indicator of low quality records that potentially are print-on-demand and might not ever have existed! But once a copy has been digitized on IA, I reckon we want to be able to import it.
Reproducing the bug
Run while logged in with an account that has import perms:
await fetch("https://openlibrary.org/api/import/ia?identifier=richestmaninbaby0000geor_d9k8&require_marc=false&debug=***", {
"credentials": "include",
"referrer": "https://openlibrary.org/books/ia:richestmaninbaby0000geor_d9k8/richestmaninbaby0000geor_d9k8",
"method": "POST",
"mode": "cors"
});
- Expected behavior: The book is imported
- Actual behavior: It errors due to Independently published:
Context
- Browser (Chrome, Safari, Firefox, etc):
- OS (Windows, Mac, etc):
- Logged in (Y/N):
- Environment (prod, dev, local): prod
Breakdown
Requirements Checklist
- [ ]
Related files
Stakeholders
Instructions for Contributors
- Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to Github, because the pre-commit bot may add commits to your PRs upstream.
Hi everyone, may I please work on this issue? Thank you!
I think by the time this code section is hit, the rec will have a source_records value for ia:, so maybe if that's present, this check shouldn't apply: https://github.com/internetarchive/openlibrary/blob/3a6883a047309fbf323bf9e01fb77357c26d056d/openlibrary/catalog/add_book/init.py#L814-L815
But once a copy has been digitized on IA, I reckon we want to be able to import it.
Why? Does a copyfraud edition gain special privileges because it was able to avoid detection and get uploaded to the Internet Archive? How is the edition above different from:
https://openlibrary.org/books/OL41089351M/The_Seventh_Man https://openlibrary.org/books/OL59281637M
or any of the hundreds of other "independently published" editions that this restriction was put in place to restrict? Does digitizing a print-on-demand book grant it a special cachet? Why not just use the original?
Why doesn't the text of a public domain version, rendered with a larger font, suffice? Why doesn't the Project Gutenberg pseudo-edition suffice?
This particular work (https://openlibrary.org/works/OL8075019W/) has 191 "editions" on OpenLibrary, of which, less than a dozen are actual editions. Why should the poor librarians have to curate hundreds of low quality fake "editions?"
Most of our concerns with Independently Published materials is that they are largely books that do not exist -- like those terrible examples of "Seventh Man" you've listed:
These are from when we allowed imports of "Independently Published" from Amazon. And Amazon itself has since deleted these fraudulent books. Our strategy with these right now is to organize them into works to avoid confusing/misleading search results, but long term we also need to come up with a strategy to remove these.
But note our block on "Independently Published" books is very much a casting of an over-wide net. There are many valid Independently Published books, but the signal-to-noise ratio was not at all worth it, so to save our catalogue, we blocked them all.
Books on Internet Archive which are "Independently Published" is a very small subset of books which demonstrably do exist, and many of these aren't reprints, but just entirely independent works which we can't classify/import into OL because of this block applying to IA. Here's a random subset of such books. Note many of these are in open library because they were from before the block; here's books that are missing. Punting on the question of handling of facsimiles/reprints, since the vast majority of these don't fall into that case.