openlibrary
openlibrary copied to clipboard
Remove 40k+ amazon placeholder images
Problem
Followup to https://github.com/internetarchive/openlibrary/issues/9156
We should remove the "no image available" images from Amazon.
Like this one: https://ia800505.us.archive.org/view_archive.php?archive=/5/items/m_covers_0012/m_covers_0012_53.zip&file=0012539916-M.jpg
Here's a CSV of the covers with those dimensions https://github.com/user-attachments/files/16592832/covers_60_40.csv
I did a quick check and most I saw were the amazon image with md5 hash 681e43bb536038b0ecb97ed0c13b5948 but out of the 1000 I loaded I did see one that was different.
Proposal & Constraints
What is the proposed solution / implementation?
- Calculate the md5 has of all the images in this CSV
- Make a list of those with with the amazon image MD5 hash
- interesect those cover IDs with the editions and works data dumps to figure out which editions/works need to be edited
- Remove those covers
- Update the import process so covers with the relevant
md5sumvalue are not longer added.
Leads
Related files
Stakeholders
@cdrini feel free to edit/improve this with your thoughts so this is ready for someone to work on 👍
Instructions for Contributors
- Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to Github, because the pre-commit bot may add commits to your PRs upstream.
It'll be nice to see these disappear! If that cover is representative, you could save a little time by filtering for images with a length of 1770 bytes before doing the md5sum. Also, if you're templating URLs for the covers, the starting URL before redirection is https://covers.openlibrary.org/b/id/0012539916-[S|M|L].jpg.
Hi everyone, is this issue still active? Based on the initial comment, it seems the issue was to be prepared for work. Can anyone confirm if it’s ready to be addressed now?
@OutstandingWork, this is ready to be addressed and I have assigned it to you.
@tfmorris, how were you checking the size first? I tried curl -I and curl -LI but couldn't seem to get the Content-Length.
Hi @scottbarnes can you please give some directions on how to start.
- Using this csv files where can I look for images to compute its MD5 hash.
- Do I need some special access to the OpenLibrary Databases to perform this.
- Where can I access the database such that I can remove those images
Good questions, @OutstandingWork. Let's tackle (1) for now.
Maybe someone will have a much better idea, but iterating through covers_60_40.csv to check for matching md5 hashes is probably a good start.
You may be able to do this for just one size (e.g. -M), and that may indicate the other two sizes are/are not this Amazon image. You may also be able to get the image size somehow, as tfmorris suggested, without downloading the image, so you'd only need to download images that might match the md5 hash (and thereby speed this up), but after some brief experimenting I failed at that.
Once you've got all the IDs that match the hash, the next step would be parsing the Open Library data dump as Ray mentioned, to look for this cover ID in editions and works, and extracting the edition and work IDs so we can remove the covers by scripting that process, but for now, let's just get the relevant IDs.
Absent a better way (and there could easily be one that doesn't involve downloading every single cover image), something like this should work, if quite slowly:
for id in $(cat ~/Downloads/covers_60_40.csv); do
[ "$(curl -Ls https://covers.openlibrary.org/b/id/${id}-M.jpg | md5sum | awk '{print $1}')" = "681e43bb536038b0ecb97ed0c13b5948" ] && echo ${id};
done | tee these_match.txt
Testing this out:
❯ for id in $(cat ~/Downloads/covers_60_40.csv); do [ "$(curl -Ls https://covers.openlibrary.org/b/id/${id}-M.jpg | md5sum | awk '{print $1}')" = "681e43bb536038b0ecb97ed0c13b5948" ] && echo ${id}; done | tee these_match.txt
11523496
11523556
11523560
11523581
11523582
11523583
^C
❯ cat these_match.txt
11523496
11523556
11523560
11523581
11523582
11523583
Also, if anyone has any better ideas, please don't hesitate to chime in. :)
Hi @scottbarnes, Apologies for delayed response. I have 1st tried computing the md5sum for each ID in CSV file w.r.t. target md5sum, but the process seemed slow. Then as suggested by tfmorris I computed the size of the target md5sum image and added the check size, to speed up the process I wrote a python script to parallely call the image with the ID's but that resulted in error 429(for too many requests). Can anyone suggest whether I should wait for a longer time to get the image ID's with matching md5sum or some other method to achieve this in lesser time
How long will you have to wait to load all the images and compute the hash? For 40k images at 2 seconds each would be less than a day which seems fine. Just make sure you're writing idempotent code so if your script crashes you don't lose progress and can continue right where you left off. @OutstandingWork
For each image it takes around 1 sec for me, I will then do it part by part as suggested
Hi everyone, I have extracted the ID's of the images which match with our target image. Their count was around 43k (where 45k were total ID's). Can anyone suggest what would be the next step from here?
Please upload the list (of both matching and not matching) here as a CSV and then I'm sure Scott can provide more info on what to do next.
It might also be a good idea to prevent these images from uploading in the future by hardcoding that hash into the upload pipeline?
Sure, here is the CSV file for matching and non matching ID's List.csv
I was just spot checking List.csv, and I wonder if something went wrong. I checked the last non-matching ID, at least if I am reading the CSV correctly, but the md5sum matches:
❯ curl -Ls https://covers.openlibrary.org/b/id/0012340198-M.jpg | md5sum | awk '{print $1}'
681e43bb536038b0ecb97ed0c13b5948
In any event, once that's sorted, you'll want to parse, I imagine, the "all_types_dump" to get the work and edition IDs of the cover IDs that have matching m5dsums. E.g., for that cover ID above, OL26392797W (https://openlibrary.org/works/OL26392797W/PANQUEQUES_DE_MANZANAS) is a work with cover ID 12340198.
We'd want to keep track of OL26392797W so we can update it to remove the Amazon cover.
With respect to List.csv, am I understanding the format correctly, insofar as if there is an ID in the second field, that ID isn't a match? I wonder if maybe it would be easier for parsing to just have a file of matches and a file of non-matches, one per line? Or if you'd like to stick with the CSV format, then have the first field always be the cover ID, and the second field be true/false, 1/0, etc. for matching/non-matching.
I just went ahead and ran the shell script I shared above. these_match.txt
There were only a handful that didn't match. Some don't exist in the latest dump (deleted covers?), and at least one has the same appearance as the Amazon placeholder image, but a different md5sum. This is the one I spot checked that had the placeholder with a different hash value: https://openlibrary.org/books/OL11077440M.
Here are the ones that didn't match:
❯ grep -Fvf ~/Downloads/these_match.txt ~/Downloads/covers_60_40.csv
cover_id
6275921
10828140
10828141
10828143
10828150
10828152
10828160
10828178
10828179
10828180
10828187
10828208
10828227
10828237
10828244
Let's not worry about these just now.
The next step would be parsing the Open Library data dump as Ray mentioned, to look for this cover ID in editions and works, and extracting the edition and work IDs so we can remove the covers by scripting that process. The 'all types dump' may be the easiest to work with: https://openlibrary.org/developers/dumps.
Hi @scottbarnes, I have rechecked the non matching ID's and found that there were only 19 IDs that did not match. So what now we have to do is take those matching IDs and iterate each ID's through the all type dumps txt file, and store the results after extracting the edition and work IDs. But it might take time since the all types dump consist of txt file which is of 33GB. Please correct me for my next approach.
@OutstandingWork I think you should be able to do it with the editions and works dumps. Also using duckdb (an example is in the dump page) you should be able to exact just the work id and covers pretty quickly for easier processing against your list in a following step.
@OutstandingWork are you still willing to work on this one?