openlibrary
openlibrary copied to clipboard
Calculate the height and width of the 3.3 million covers missing them
Problem
Followup to https://github.com/internetarchive/openlibrary/issues/9156#issuecomment-2286294367
Related: #10250.
There are 3,370,502 covers that have a null height/width. We should figure out how to cover get the height and width back filled with them.
I don't know how that would work so @cdrini or someone should update this ticket with more info 👍
#10250 is a follow up to this.
Evidence / Screenshot
Relevant URL(s)
Reproducing the bug
- Go to ...
- Do ...
- Expected behavior:
- Actual behavior:
Context
- Browser (Chrome, Safari, Firefox, etc):
- OS (Windows, Mac, etc):
- Logged in (Y/N): Y
- Environment (prod, dev, local): prod
Notes from this Issue's Lead
Proposal & constraints
Related files
Stakeholders
Instructions for Contributors
- Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to Github, because the pre-commit bot may add commits to your PRs upstream.
Hi @RayBB @mekarpeles @scottbarnes ,
I would like to work on this issue regarding calculating the height and width of the 3.3 million missing covers. I've already set up OpenLibrary on my local machine, so I can start working on it right away.
Could you please provide more clarification on how you'd like to handle this? Specifically, I would appreciate more details on how to retrieve and backfill the missing height and width values for the covers. I’m eager to get started and will try to wrap this up as quickly as possible under your guidance.
Thank you!
I am not sure of the best way to handle this, short of examining every image. It could be done remotely via the API, but I think that may take a fairly long time. If we had cover zips publicly available, those could be used, but those aren't currently available.
This is another one where it may be easiest for staff to do. I need to check in with @mekarpeles about time and priorities for how we want to go forward with this one.
Alright @scottbarnes please discuss with @mekarpeles, and if there’s anything in this issue that I can help with, I’d love to contribute!
If possible, I think it would be great for us to have a remediation script living in the scripts directory or the coverstore service. For whatever reason, we'll likely hit cases where we get out of sync and some covers don't have dimensions.
We'd love help writing this script, ultimately thought I think it would be best if this is a script that is run by staff as a job on covers.
The script should iterate over the rows in the covers data dump that have no dimensions and, for each row, use the cover API to fetch the cover and get the dimensions, and write the result in a sensible way to an output file (e.g. jsonl, csv) file which we can then load to update the db dimensions field.
Bonus: if the script has an option to control how many requests made in parallel
Hi @RayBB @mekarpeles @scottbarnes,
I’m ready to help with the task of backfilling the missing height and width values for the 3.3 million covers. Here’s what I think for this:
->Firstly, I will write a script that iterates over the covers with missing dimensions, fetches the cover dimensions using the cover API, and saves the data to an output file (JSONL or CSV).
-> I plan to make parallel requests to optimize performance and reduce the processing time. I will use libraries like asyncio or concurrent.futures to handle multiple requests simultaneously.
-> I will first test the script on a small subset of covers to ensure it functions correctly before processing the full dataset.
Once the script is ready and thoroughly tested, I will push it to the relevant directory and provide updates on the progress. Once I will get assigned, I’ll start working on the task and would appreciate some clarifications later on if I will have any doubts regarding the task or any file structure.
Thank you!
Thanks for offering to work on this, @SharkyBytes. I assigned this to you.
For my part, I am not super focused on performance so please make the concurrency the last step. We don't want to swamp the server so this is going to run slowly, and likely for days, perhaps many days. Time is not of the essence, but not impacting performance for patrons is.
More importantly than concurrency, the script should be able to handle network failures and retry using an exponential back off. requests should have
Since this is likely to be long running, and we're iterating over a dump file, please also, as a stretch goal at least, consider adding a way to keep track of progress (e.g. writing the line current line number from the dump to a status file, or perhaps simply printing out to the screen, though that could be lost in a reboot).
Please also stick to the standard library, with the exception of Pillow (and requests if you want to use that). If it turns out not to make sense to stick to the standard library, please bring it up and we can talk about what to do.
In terms of output, it's up to you whether to use jsonl or csv. Something similar to either of the following probably makes sense.
{"cover_id": 12345, "height": 20, "width": 10}
{"cover_id": 67890, "height": 500, "width": 400}
cover_id,height,width
12345,20,10
67890,500,400
Finally, please include some sort of integration test so we can run the script on, a sample of lines from the covers dump (say 10 lines), including both lines with height and width, and those without, and verify that the output written to the file is correct.
@scottbarnes Is it feasible to calculate cover sizes on upload so that we don't need to keep running this regularly? Seems like it wouldn't require much effort to make this problem go away for good but I'm not so familiar with the cover service.
I think calculating the cover sizes on upload/import probably makes more sense than running this script more than once. We already have Pillow as a requirement for the project so it should be pretty easy, though a separate issue from this one. I can make an issue later once I've had time to look at it a bit more.
Edit: I made #10250 and edited the issue description to include that related issue. My preference would be to address that only if we know it's a problem (which perhaps someone with more knowledge can simply make a definitive statement about).
@scottbarnes is this still open?
I believe so.
To move this along I assigned it to me and am running the script to populate the height and width values via the API for now.
Note: Most these have now been resolved thanks to @scottbarnes ! A smattering ~70k errored for various reasons, Scott's investigating these.
The new sizes were included for the first time in the 2025-05-31 dump. That dump includes a couple of million suspicious looking cover sizes (mostly pre-existing, not from the new calculations) which could be Amazon swooshes or some other type of thumbnail. Here are the top 15 dimension pairs (first column is occurrence count)
gzcat ol_dump_covers_metadata_2025-05-31.txt.gz | cut -f 2-3 | sort | uniq -c | sort -r -n | head -n 15
891985 333 500
217135 315 500
139645 324 500
119204 313 500
104094 386 500
96187 500 500
95204 375 500
86755 128 192
84451 331 500
83904 329 500
82928 323 500
80714 350 500
78253 330 500
77789
72749 328 500
See also #9737 for the amazon swooshes