openlibrary
openlibrary copied to clipboard
Switch covers importing IA endpoint
Switch covers import endpoint to be eg https://archive.org/services/img/aldosfantastical0000frie/full/pct:600/0/default.jpg .
This appears to more accurately choose between cover and title page. Avoiding blank covers, but using them when necessary.
Some evaluation would be useful to see if this is indeed a better switch.
https://github.com/internetarchive/openlibrary/blob/2a36c54ed941449d0bf4ad84497f5bc5f3c747b0/openlibrary/core/ia.py#L117-L123
Stakeholders
@scottbarnes @hornc
The short of this is that the endpoint used to display covers on IA searches seems substantially better than the current endpoint, though the resolution is lower on the proposed newer endpoint.
I made a Google Colab that enables a very simplistic cover comparison between the current and proposed 'new' cover endpoints for around 25 books that were on the import list of IA items that appeared book-like, but lacked a MARC record, and those that appeared book-like and had a MARC record: https://colab.research.google.com/drive/18AT-Hdu7j9dyrVaSR3VAEPTceC9qVpBp#scrollTo=yS1E0hh00KFT
The script simply iterates through the list of ocaid
s and for each one fetches the cover via the current end point, and the 'new' one.
The script takes about two minutes to run (for some reason...), but the results should be pretty easy to scroll through and analyze. Each image is labeled at the top as to which endpoint it comes from.
The 'new' endpoint seems to have either the same covers, or 'better' ones. However, the covers an the proposed 'new' endpoint are usually a lower resolution. I didn't resize any of the image output so that it's easier to see what is fetched, even if it makes it a bit annoying to view.
@cdrini @hornc
@cdrini @scottbarnes I think there's a deeper problem here than just switching the endpoints.
My understanding ~is~? / was that when a book was scanned, it's best representative "title" page was marked -- if the title appeared clearly on the cover, that's what would be chosen, if there was an internal "title page", as is often the case with older cloth or leather bound books, that's what would be at title.jpg
It's possible that something has changed in the scanning process. I see archive.org items have a bookplateleaf
page number, which seems related, but I can't see how it relates to any of the result in the collab.
The bestlovedpoems0000henr
example shows what the current system is trying to protect against: it shows an internal page instead of the cloth cover. The 'new' method just shows blank cloth.
operationjanus0000cros
shows the current system fetching the correct image cover without pulling up an internal title page, which isn't needed.
vtenitvoikhsnovr0000hoop
seems like it has bad data for title.jpg... and many of the other examples seem to have unnecessary marked up internal title pages when their cover should do.
My feeling is the OL code is currently doing the right thing, but archive.org has data issues with how some title / covers are marked up.
- Checking when the items that have bad results using the current scheme were scanned recently or a long time ago will confirm whether OL is in line with current scanning practice.
If recent scanned items are inconsistent, or consistently showing title pages when we'd expect covers, we'll need to feed back to the scanning process.
If recent scans are consistent and there's a different current way of picking the best 'title' image, we'll need to know what that is.
Ah, I see what you are saying about the cloth cover issue, @hornc. For some reason I wasn't seeing the whole list, even though I created the thing. :)
I will look at trying to get some more recent data for further discussion.
My understanding ~is~? / was that when a book was scanned, it's best representative "title" page was marked -- if the title appeared clearly on the cover, that's what would be chosen, if there was an internal "title page", as is often the case with older cloth or leather bound books, that's what would be at
title.jpg
That appears to no longer be the case; now it appears that title.jpg
is always available regardless of the cover, and set to the title page. Note the code snippet above defaults to /title.jpg
! It only uses cover.jpg
as a fallback if title.jpg
404s. So effectively this is always returning title pages!
Here is a comparison of the three endpoints, based on the most recent 100 importbot edit OCAIDs. It seems like default.jpg
has been identical to covers.jpg
in these cases, but I think that might just be cause the items appear to be a little old. I know Jude recently added new logic for cloth cover detection, but not sure which endpoint uses that.
Comparison
@cdrini The feature / user story here as I see it is:
In order to get a visual indication of what the book is, (as a library patron) I want to see a book image with the title, author, and cover artwork (if any).
This means for most modern style books we want cover_url
, but for cloth bound books and similar, we need title_url
, since looking at the cloth does not satisfy the above.
AFAICT default_url
is always the same image as cover_url
?
There used to be either a manual process or automated smarts to figure this out correctly, and it looks like it's no longer working. archive.org is probably going to have the same issue -- maybe there is another way to determine cloth bound and pick the correct and useful image?