openlibrary icon indicating copy to clipboard operation
openlibrary copied to clipboard

Remove 40k+ amazon placeholder images

Open RayBB opened this issue 1 year ago • 17 comments

Problem

Followup to https://github.com/internetarchive/openlibrary/issues/9156

We should remove the "no image available" images from Amazon. Like this one: https://ia800505.us.archive.org/view_archive.php?archive=/5/items/m_covers_0012/m_covers_0012_53.zip&file=0012539916-M.jpg image

Here's a CSV of the covers with those dimensions https://github.com/user-attachments/files/16592832/covers_60_40.csv

I did a quick check and most I saw were the amazon image with md5 hash 681e43bb536038b0ecb97ed0c13b5948 but out of the 1000 I loaded I did see one that was different.

Proposal & Constraints

What is the proposed solution / implementation?

  1. Calculate the md5 has of all the images in this CSV
  2. Make a list of those with with the amazon image MD5 hash
  3. interesect those cover IDs with the editions and works data dumps to figure out which editions/works need to be edited
  4. Remove those covers
  5. Update the import process so covers with the relevant md5sum value are not longer added.

Leads

Related files

Stakeholders

@cdrini feel free to edit/improve this with your thoughts so this is ready for someone to work on 👍


Instructions for Contributors

  • Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to Github, because the pre-commit bot may add commits to your PRs upstream.

RayBB avatar Aug 13 '24 14:08 RayBB

It'll be nice to see these disappear! If that cover is representative, you could save a little time by filtering for images with a length of 1770 bytes before doing the md5sum. Also, if you're templating URLs for the covers, the starting URL before redirection is https://covers.openlibrary.org/b/id/0012539916-[S|M|L].jpg.

tfmorris avatar Aug 13 '24 20:08 tfmorris

Hi everyone, is this issue still active? Based on the initial comment, it seems the issue was to be prepared for work. Can anyone confirm if it’s ready to be addressed now?

OutstandingWork avatar Dec 19 '24 18:12 OutstandingWork

@OutstandingWork, this is ready to be addressed and I have assigned it to you.

@tfmorris, how were you checking the size first? I tried curl -I and curl -LI but couldn't seem to get the Content-Length.

scottbarnes avatar Dec 20 '24 14:12 scottbarnes

Hi @scottbarnes can you please give some directions on how to start.

  1. Using this csv files where can I look for images to compute its MD5 hash.
  2. Do I need some special access to the OpenLibrary Databases to perform this.
  3. Where can I access the database such that I can remove those images

OutstandingWork avatar Dec 22 '24 05:12 OutstandingWork

Good questions, @OutstandingWork. Let's tackle (1) for now.

Maybe someone will have a much better idea, but iterating through covers_60_40.csv to check for matching md5 hashes is probably a good start.

You may be able to do this for just one size (e.g. -M), and that may indicate the other two sizes are/are not this Amazon image. You may also be able to get the image size somehow, as tfmorris suggested, without downloading the image, so you'd only need to download images that might match the md5 hash (and thereby speed this up), but after some brief experimenting I failed at that.

Once you've got all the IDs that match the hash, the next step would be parsing the Open Library data dump as Ray mentioned, to look for this cover ID in editions and works, and extracting the edition and work IDs so we can remove the covers by scripting that process, but for now, let's just get the relevant IDs.

Absent a better way (and there could easily be one that doesn't involve downloading every single cover image), something like this should work, if quite slowly:

for id in $(cat ~/Downloads/covers_60_40.csv); do
  [ "$(curl -Ls https://covers.openlibrary.org/b/id/${id}-M.jpg | md5sum | awk '{print $1}')" = "681e43bb536038b0ecb97ed0c13b5948" ] && echo ${id};
done | tee these_match.txt

Testing this out:

❯ for id in $(cat ~/Downloads/covers_60_40.csv); do [ "$(curl -Ls https://covers.openlibrary.org/b/id/${id}-M.jpg | md5sum | awk '{print $1}')" = "681e43bb536038b0ecb97ed0c13b5948" ] && echo ${id}; done | tee these_match.txt
11523496
11523556
11523560
11523581
11523582
11523583
^C

❯ cat these_match.txt
11523496
11523556
11523560
11523581
11523582
11523583

Also, if anyone has any better ideas, please don't hesitate to chime in. :)

scottbarnes avatar Dec 22 '24 06:12 scottbarnes

Hi @scottbarnes, Apologies for delayed response. I have 1st tried computing the md5sum for each ID in CSV file w.r.t. target md5sum, but the process seemed slow. Then as suggested by tfmorris I computed the size of the target md5sum image and added the check size, to speed up the process I wrote a python script to parallely call the image with the ID's but that resulted in error 429(for too many requests). Can anyone suggest whether I should wait for a longer time to get the image ID's with matching md5sum or some other method to achieve this in lesser time

OutstandingWork avatar Dec 26 '24 13:12 OutstandingWork

How long will you have to wait to load all the images and compute the hash? For 40k images at 2 seconds each would be less than a day which seems fine. Just make sure you're writing idempotent code so if your script crashes you don't lose progress and can continue right where you left off. @OutstandingWork

RayBB avatar Dec 26 '24 17:12 RayBB

For each image it takes around 1 sec for me, I will then do it part by part as suggested

OutstandingWork avatar Dec 26 '24 17:12 OutstandingWork

Hi everyone, I have extracted the ID's of the images which match with our target image. Their count was around 43k (where 45k were total ID's). Can anyone suggest what would be the next step from here?

OutstandingWork avatar Dec 29 '24 18:12 OutstandingWork

Please upload the list (of both matching and not matching) here as a CSV and then I'm sure Scott can provide more info on what to do next.

RayBB avatar Dec 29 '24 18:12 RayBB

It might also be a good idea to prevent these images from uploading in the future by hardcoding that hash into the upload pipeline?

RayBB avatar Dec 29 '24 18:12 RayBB

Sure, here is the CSV file for matching and non matching ID's List.csv

OutstandingWork avatar Dec 30 '24 06:12 OutstandingWork

I was just spot checking List.csv, and I wonder if something went wrong. I checked the last non-matching ID, at least if I am reading the CSV correctly, but the md5sum matches:

❯ curl -Ls https://covers.openlibrary.org/b/id/0012340198-M.jpg | md5sum | awk '{print $1}'
681e43bb536038b0ecb97ed0c13b5948

In any event, once that's sorted, you'll want to parse, I imagine, the "all_types_dump" to get the work and edition IDs of the cover IDs that have matching m5dsums. E.g., for that cover ID above, OL26392797W (https://openlibrary.org/works/OL26392797W/PANQUEQUES_DE_MANZANAS) is a work with cover ID 12340198.

We'd want to keep track of OL26392797W so we can update it to remove the Amazon cover.

With respect to List.csv, am I understanding the format correctly, insofar as if there is an ID in the second field, that ID isn't a match? I wonder if maybe it would be easier for parsing to just have a file of matches and a file of non-matches, one per line? Or if you'd like to stick with the CSV format, then have the first field always be the cover ID, and the second field be true/false, 1/0, etc. for matching/non-matching.

scottbarnes avatar Dec 30 '24 15:12 scottbarnes

I just went ahead and ran the shell script I shared above. these_match.txt

There were only a handful that didn't match. Some don't exist in the latest dump (deleted covers?), and at least one has the same appearance as the Amazon placeholder image, but a different md5sum. This is the one I spot checked that had the placeholder with a different hash value: https://openlibrary.org/books/OL11077440M.

Here are the ones that didn't match:

❯ grep -Fvf ~/Downloads/these_match.txt ~/Downloads/covers_60_40.csv
cover_id
6275921
10828140
10828141
10828143
10828150
10828152
10828160
10828178
10828179
10828180
10828187
10828208
10828227
10828237
10828244

Let's not worry about these just now.

The next step would be parsing the Open Library data dump as Ray mentioned, to look for this cover ID in editions and works, and extracting the edition and work IDs so we can remove the covers by scripting that process. The 'all types dump' may be the easiest to work with: https://openlibrary.org/developers/dumps.

scottbarnes avatar Dec 30 '24 23:12 scottbarnes

Hi @scottbarnes, I have rechecked the non matching ID's and found that there were only 19 IDs that did not match. So what now we have to do is take those matching IDs and iterate each ID's through the all type dumps txt file, and store the results after extracting the edition and work IDs. But it might take time since the all types dump consist of txt file which is of 33GB. Please correct me for my next approach.

OutstandingWork avatar Jan 05 '25 13:01 OutstandingWork

@OutstandingWork I think you should be able to do it with the editions and works dumps. Also using duckdb (an example is in the dump page) you should be able to exact just the work id and covers pretty quickly for easier processing against your list in a following step.

RayBB avatar Jan 05 '25 14:01 RayBB

@OutstandingWork are you still willing to work on this one?

RayBB avatar Jun 13 '25 01:06 RayBB