openlibrary Calculate the height and width of the 3.3 million covers missing them

Problem

Followup to https://github.com/internetarchive/openlibrary/issues/9156#issuecomment-2286294367

Related: #10250.

There are 3,370,502 covers that have a null height/width. We should figure out how to cover get the height and width back filled with them.

I don't know how that would work so @cdrini or someone should update this ticket with more info 👍

#10250 is a follow up to this.

Evidence / Screenshot

Relevant URL(s)

Reproducing the bug

Go to ...
Do ...

Expected behavior:
Actual behavior:

Context

Browser (Chrome, Safari, Firefox, etc):
OS (Windows, Mac, etc):
Logged in (Y/N): Y
Environment (prod, dev, local): prod

Notes from this Issue's Lead

Proposal & constraints

Related files

Stakeholders

Instructions for Contributors

Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to Github, because the pre-commit bot may add commits to your PRs upstream.

Aug 13 '24 14:08 RayBB

Hi @RayBB @mekarpeles @scottbarnes ,

I would like to work on this issue regarding calculating the height and width of the 3.3 million missing covers. I've already set up OpenLibrary on my local machine, so I can start working on it right away.

Could you please provide more clarification on how you'd like to handle this? Specifically, I would appreciate more details on how to retrieve and backfill the missing height and width values for the covers. I’m eager to get started and will try to wrap this up as quickly as possible under your guidance.

Thank you!

Dec 25 '24 17:12 SharkyBytes

I am not sure of the best way to handle this, short of examining every image. It could be done remotely via the API, but I think that may take a fairly long time. If we had cover zips publicly available, those could be used, but those aren't currently available.

This is another one where it may be easiest for staff to do. I need to check in with @mekarpeles about time and priorities for how we want to go forward with this one.

Dec 26 '24 05:12 scottbarnes

Alright @scottbarnes please discuss with @mekarpeles, and if there’s anything in this issue that I can help with, I’d love to contribute!

Dec 27 '24 09:12 SharkyBytes

If possible, I think it would be great for us to have a remediation script living in the scripts directory or the coverstore service. For whatever reason, we'll likely hit cases where we get out of sync and some covers don't have dimensions.

We'd love help writing this script, ultimately thought I think it would be best if this is a script that is run by staff as a job on covers.

The script should iterate over the rows in the covers data dump that have no dimensions and, for each row, use the cover API to fetch the cover and get the dimensions, and write the result in a sensible way to an output file (e.g. jsonl, csv) file which we can then load to update the db dimensions field.

Bonus: if the script has an option to control how many requests made in parallel

Dec 30 '24 21:12 mekarpeles

Hi @RayBB @mekarpeles @scottbarnes,

I’m ready to help with the task of backfilling the missing height and width values for the 3.3 million covers. Here’s what I think for this:

->Firstly, I will write a script that iterates over the covers with missing dimensions, fetches the cover dimensions using the cover API, and saves the data to an output file (JSONL or CSV).

-> I plan to make parallel requests to optimize performance and reduce the processing time. I will use libraries like asyncio or concurrent.futures to handle multiple requests simultaneously.

-> I will first test the script on a small subset of covers to ensure it functions correctly before processing the full dataset.

Once the script is ready and thoroughly tested, I will push it to the relevant directory and provide updates on the progress. Once I will get assigned, I’ll start working on the task and would appreciate some clarifications later on if I will have any doubts regarding the task or any file structure.

Thank you!

Jan 01 '25 18:01 SharkyBytes

Thanks for offering to work on this, @SharkyBytes. I assigned this to you.

For my part, I am not super focused on performance so please make the concurrency the last step. We don't want to swamp the server so this is going to run slowly, and likely for days, perhaps many days. Time is not of the essence, but not impacting performance for patrons is.

More importantly than concurrency, the script should be able to handle network failures and retry using an exponential back off. requests should have

Since this is likely to be long running, and we're iterating over a dump file, please also, as a stretch goal at least, consider adding a way to keep track of progress (e.g. writing the line current line number from the dump to a status file, or perhaps simply printing out to the screen, though that could be lost in a reboot).

Please also stick to the standard library, with the exception of Pillow (and requests if you want to use that). If it turns out not to make sense to stick to the standard library, please bring it up and we can talk about what to do.

In terms of output, it's up to you whether to use jsonl or csv. Something similar to either of the following probably makes sense.

{"cover_id": 12345, "height": 20, "width": 10}
{"cover_id": 67890, "height": 500, "width": 400}

cover_id,height,width
12345,20,10
67890,500,400

Finally, please include some sort of integration test so we can run the script on, a sample of lines from the covers dump (say 10 lines), including both lines with height and width, and those without, and verify that the output written to the file is correct.

Jan 01 '25 19:01 scottbarnes

@scottbarnes Is it feasible to calculate cover sizes on upload so that we don't need to keep running this regularly? Seems like it wouldn't require much effort to make this problem go away for good but I'm not so familiar with the cover service.

Jan 01 '25 20:01 RayBB

I think calculating the cover sizes on upload/import probably makes more sense than running this script more than once. We already have Pillow as a requirement for the project so it should be pretty easy, though a separate issue from this one. I can make an issue later once I've had time to look at it a bit more.

Edit: I made #10250 and edited the issue description to include that related issue. My preference would be to address that only if we know it's a problem (which perhaps someone with more knowledge can simply make a definitive statement about).

Jan 01 '25 20:01 scottbarnes

@scottbarnes is this still open?

Mar 31 '25 19:03 mekarpeles

I believe so.

Mar 31 '25 19:03 scottbarnes

To move this along I assigned it to me and am running the script to populate the height and width values via the API for now.

Apr 01 '25 04:04 scottbarnes

Note: Most these have now been resolved thanks to @scottbarnes ! A smattering ~70k errored for various reasons, Scott's investigating these.

Jun 12 '25 13:06 cdrini

The new sizes were included for the first time in the 2025-05-31 dump. That dump includes a couple of million suspicious looking cover sizes (mostly pre-existing, not from the new calculations) which could be Amazon swooshes or some other type of thumbnail. Here are the top 15 dimension pairs (first column is occurrence count)

891985 333	500
217135 315	500
139645 324	500
119204 313	500
104094 386	500
 96187 500	500
 95204 375	500
 86755 128	192
 84451 331	500
 83904 329	500
 82928 323	500
 80714 350	500
 78253 330	500
 77789 	
 72749 328	500

Jun 12 '25 14:06 tfmorris

openlibrary openlibrary copied to clipboard

Calculate the height and width of the 3.3 million covers missing them

Problem

Evidence / Screenshot

Relevant URL(s)

Reproducing the bug

Context

Notes from this Issue's Lead

Proposal & constraints

Related files

Stakeholders

Instructions for Contributors

openlibrary
openlibrary copied to clipboard