openlibrary Switch covers importing IA endpoint

Switch covers import endpoint to be eg https://archive.org/services/img/aldosfantastical0000frie/full/pct:600/0/default.jpg .

This appears to more accurately choose between cover and title page. Avoiding blank covers, but using them when necessary.

Some evaluation would be useful to see if this is indeed a better switch.

https://github.com/internetarchive/openlibrary/blob/2a36c54ed941449d0bf4ad84497f5bc5f3c747b0/openlibrary/core/ia.py#L117-L123

Stakeholders

@scottbarnes @hornc

Mar 09 '23 01:03 cdrini

The short of this is that the endpoint used to display covers on IA searches seems substantially better than the current endpoint, though the resolution is lower on the proposed newer endpoint.

I made a Google Colab that enables a very simplistic cover comparison between the current and proposed 'new' cover endpoints for around 25 books that were on the import list of IA items that appeared book-like, but lacked a MARC record, and those that appeared book-like and had a MARC record: https://colab.research.google.com/drive/18AT-Hdu7j9dyrVaSR3VAEPTceC9qVpBp#scrollTo=yS1E0hh00KFT

The script simply iterates through the list of ocaids and for each one fetches the cover via the current end point, and the 'new' one.

The script takes about two minutes to run (for some reason...), but the results should be pretty easy to scroll through and analyze. Each image is labeled at the top as to which endpoint it comes from.

The 'new' endpoint seems to have either the same covers, or 'better' ones. However, the covers an the proposed 'new' endpoint are usually a lower resolution. I didn't resize any of the image output so that it's easier to see what is fetched, even if it makes it a bit annoying to view.

@cdrini @hornc

Mar 27 '23 03:03 scottbarnes

@cdrini @scottbarnes I think there's a deeper problem here than just switching the endpoints.

My understanding ~is~? / was that when a book was scanned, it's best representative "title" page was marked -- if the title appeared clearly on the cover, that's what would be chosen, if there was an internal "title page", as is often the case with older cloth or leather bound books, that's what would be at title.jpg

It's possible that something has changed in the scanning process. I see archive.org items have a bookplateleaf page number, which seems related, but I can't see how it relates to any of the result in the collab.

The bestlovedpoems0000henr example shows what the current system is trying to protect against: it shows an internal page instead of the cloth cover. The 'new' method just shows blank cloth.

operationjanus0000cros shows the current system fetching the correct image cover without pulling up an internal title page, which isn't needed.

vtenitvoikhsnovr0000hoop seems like it has bad data for title.jpg... and many of the other examples seem to have unnecessary marked up internal title pages when their cover should do.

My feeling is the OL code is currently doing the right thing, but archive.org has data issues with how some title / covers are marked up.

Checking when the items that have bad results using the current scheme were scanned recently or a long time ago will confirm whether OL is in line with current scanning practice.

If recent scanned items are inconsistent, or consistently showing title pages when we'd expect covers, we'll need to feed back to the scanning process.

If recent scans are consistent and there's a different current way of picking the best 'title' image, we'll need to know what that is.

Mar 27 '23 04:03 hornc

Ah, I see what you are saying about the cloth cover issue, @hornc. For some reason I wasn't seeing the whole list, even though I created the thing. :)

I will look at trying to get some more recent data for further discussion.

Mar 27 '23 04:03 scottbarnes

My understanding ~is~? / was that when a book was scanned, it's best representative "title" page was marked -- if the title appeared clearly on the cover, that's what would be chosen, if there was an internal "title page", as is often the case with older cloth or leather bound books, that's what would be at title.jpg

That appears to no longer be the case; now it appears that title.jpg is always available regardless of the cover, and set to the title page. Note the code snippet above defaults to /title.jpg! It only uses cover.jpg as a fallback if title.jpg 404s. So effectively this is always returning title pages!

Here is a comparison of the three endpoints, based on the most recent 100 importbot edit OCAIDs. It seems like default.jpg has been identical to covers.jpg in these cases, but I think that might just be cause the items appear to be a little old. I know Jude recently added new logic for cloth cover detection, but not sure which endpoint uses that.

Comparison

tmp

Feb 01 '24 22:02 cdrini

@cdrini The feature / user story here as I see it is:

In order to get a visual indication of what the book is, (as a library patron) I want to see a book image with the title, author, and cover artwork (if any).

This means for most modern style books we want cover_url, but for cloth bound books and similar, we need title_url, since looking at the cloth does not satisfy the above.

AFAICT default_url is always the same image as cover_url?

There used to be either a manual process or automated smarts to figure this out correctly, and it looks like it's no longer working. archive.org is probably going to have the same issue -- maybe there is another way to determine cloth bound and pick the correct and useful image?

Feb 01 '24 23:02 hornc

openlibrary openlibrary copied to clipboard

Switch covers importing IA endpoint

Stakeholders

openlibrary
openlibrary copied to clipboard