openlibrary icon indicating copy to clipboard operation
openlibrary copied to clipboard

IA import errors with 500 when Independently published

Open cdrini opened this issue 7 months ago • 2 comments

Problem

Eg https://testing.openlibrary.org/import/preview?source=ia:seventhmanlargep0000maxb

In general we don't want to import Independently Published books into OL because since ~2022 that's been and indicator of low quality records that potentially are print-on-demand and might not ever have existed! But once a copy has been digitized on IA, I reckon we want to be able to import it.

Reproducing the bug

Run while logged in with an account that has import perms:

await fetch("https://openlibrary.org/api/import/ia?identifier=richestmaninbaby0000geor_d9k8&require_marc=false&debug=***", {
    "credentials": "include",
    "referrer": "https://openlibrary.org/books/ia:richestmaninbaby0000geor_d9k8/richestmaninbaby0000geor_d9k8",
    "method": "POST",
    "mode": "cors"
});
  • Expected behavior: The book is imported
  • Actual behavior: It errors due to Independently published:

Image

Context

  • Browser (Chrome, Safari, Firefox, etc):
  • OS (Windows, Mac, etc):
  • Logged in (Y/N):
  • Environment (prod, dev, local): prod

Breakdown

Requirements Checklist

  • [ ]

Related files

Stakeholders


Instructions for Contributors

  • Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to Github, because the pre-commit bot may add commits to your PRs upstream.

cdrini avatar May 11 '25 14:05 cdrini

Hi everyone, may I please work on this issue? Thank you!

DebbieSan avatar Jun 10 '25 14:06 DebbieSan

I think by the time this code section is hit, the rec will have a source_records value for ia:, so maybe if that's present, this check shouldn't apply: https://github.com/internetarchive/openlibrary/blob/3a6883a047309fbf323bf9e01fb77357c26d056d/openlibrary/catalog/add_book/init.py#L814-L815

scottbarnes avatar Jun 10 '25 16:06 scottbarnes

But once a copy has been digitized on IA, I reckon we want to be able to import it.

Why? Does a copyfraud edition gain special privileges because it was able to avoid detection and get uploaded to the Internet Archive? How is the edition above different from:

https://openlibrary.org/books/OL41089351M/The_Seventh_Man https://openlibrary.org/books/OL59281637M

or any of the hundreds of other "independently published" editions that this restriction was put in place to restrict? Does digitizing a print-on-demand book grant it a special cachet? Why not just use the original?

Why doesn't the text of a public domain version, rendered with a larger font, suffice? Why doesn't the Project Gutenberg pseudo-edition suffice?

This particular work (https://openlibrary.org/works/OL8075019W/) has 191 "editions" on OpenLibrary, of which, less than a dozen are actual editions. Why should the poor librarians have to curate hundreds of low quality fake "editions?"

tfmorris avatar Aug 27 '25 06:08 tfmorris

Most of our concerns with Independently Published materials is that they are largely books that do not exist -- like those terrible examples of "Seventh Man" you've listed:

Image

These are from when we allowed imports of "Independently Published" from Amazon. And Amazon itself has since deleted these fraudulent books. Our strategy with these right now is to organize them into works to avoid confusing/misleading search results, but long term we also need to come up with a strategy to remove these.

But note our block on "Independently Published" books is very much a casting of an over-wide net. There are many valid Independently Published books, but the signal-to-noise ratio was not at all worth it, so to save our catalogue, we blocked them all.

Books on Internet Archive which are "Independently Published" is a very small subset of books which demonstrably do exist, and many of these aren't reprints, but just entirely independent works which we can't classify/import into OL because of this block applying to IA. Here's a random subset of such books. Note many of these are in open library because they were from before the block; here's books that are missing. Punting on the question of handling of facsimiles/reprints, since the vast majority of these don't fall into that case.

cdrini avatar Aug 28 '25 20:08 cdrini