findimagedupes icon indicating copy to clipboard operation
findimagedupes copied to clipboard

Quite different images considered 100% similar

Open porridge opened this issue 1 year ago • 5 comments

Here are a couple of images

a b

which findimagedupes 2.20.1-2 (Debian package) considers to be 100% similar:

$ findimagedupes --threshold 100% -v fp a.jpg b.jpg 
/////////////////////z8AAAAAAAAAAAAAAAAAAAA=  .../a.jpg
/////////////////////z8AAAAAAAAAAAAAAAAAAAA=  .../b.jpg
.../a.jpg .../b.jpg

(paths ellided for brevity)

I tried analyzing how the program calculates the fingerprint and came up with this shell script:

#!/bin/bash
set -euo pipefail

function step() {
  local s="$1"; shift
  mkdir -p "step${s}"
  convert "$@" "$f" "step${s}/$f"
  cd "step${s}"
}

base="$(pwd)"
for f in *.jpg; do
  cd "$base"
  step 1 -sample '160x160!' 
  step 2 -modulate 100,-100
  step 3 -blur 3x99
  step 4 -normalize
  step 5 -equalize
  step 6 -sample '16x16'
  step 7 -threshold 50%
  step 8 -set magick mono
done

This indeed produces very similar images after step 8: a b but they are not identical, so I don't understand why the fingerprint printed in verbose mode by the program is exactly the same 🤔

I was so excited to find this program yesterday, since it's the first that I encountered that lets me automate the process and still customize the criteria for which image from an equivalence set to delete. However I found a few cases where it considers landscapes to be identical, which defeats the purpose :-(

I wonder if I tweaked the fingerprint generation function (for example to append the aspect ratio) - would this break the diffbits function? 🤔

porridge avatar Sep 06 '24 13:09 porridge

The reason for the discrepancy is that you seem to be using imagemagick rather than graphicsmagick but they produce different results in this case. Some of the operations have slightly different behaviour.

I have never looked into tweaking the algorithm (this program is a rewrite of an older one by Rob Kudla) . I seem to recall I switched to graphicsmagick purely because, at the time, it crashed less than imagemagick. You could switch it back by just replacing Graphics::Magick with Image::Magick in two places in the code, and see if that gives more desirable results for your use case (but I don't expect overall accuracy will change much).

It would certainly be possible to compare additional information but aspect ratios aren't bitmaps so I doubt they can be usefully compared with just popcount (what diffbits does). Perhaps add an additional stage to filter/partition the findimagedupes output using your additional criteria. (If you haven't already found the -script / -program options, those may be helpful.) If I ever do the rewrite to use sqlite for the fingerprint database, storing/manipulating extra information to compare should become more viable.

By the way, I am personally rather leery of fully automating any kind of destructive operations using findimagedupes since the likelihood of false positives is significant. I'd always do a manual sanity check to exclude obviously wrong results before doing anything irreversible.

jhnc avatar Sep 07 '24 04:09 jhnc

Thanks for the quick response @jhnc ! Good point about the different library, this might indeed be the cause. OTOH I agree switching to the other one might not change things a lot.

I have some additional data. Today I ran over my collection with geeqie duplicate finder, and it found a significant number of them. findimagedupes does not 😢

About half were thanks to its "Ignore orientation" feature which findimagedupes does not have. But the other half were indeed the same pictures in the same resolution, just with a different compression ratio. Here is an example.

$ findimagedupes -v fp -t 100% c.jpg d.jpg 
///////v/+f/v/////EAAMMAk4kEBwTHJAaAAgAAAAA=  .../c.jpg
///////v/+f/v/////EAAMMAk4kABwTHJAaAAgAAAAA=  .../d.jpg

And the pictures themselves:

c d

This is sad, because it's exactly the kinds of duplicates I want to catch.

To say a bit more about my use case:

  • every few months I dump all pictures from my family's devices, de-dupe them (so far semi-manually with geeqie), split into a <decade>/<year>/<month>/<day> directory structure (with jhead), and then go through per-day directories, delete the ones I don't like. Each such batch contains thousands of files.
  • one thing that became quite annoying in the recent years, is that I encounter a lot of duplicates from two sources: pictures that both me and my wife receives from my kids's school, and pictures that one of us makes (high resolution and quality) and the other receives via instant message app (lower resolution/quality). There are hundreds to thoudsands of such duplicates in each batch. I simply want to always get rid of all but the highest quality duplicates.
  • geeqie does a pretty good job at finding them, but it does not let me automatically select the lower quality ones for deletion.
  • I found findimagedupes's --program option perfect since I can easily encode some name-based heuristics in a Python script to automatically delete some classes of duplicates, and still have it display and ask for confirmation of the tricky cases.

I'll try to dig more into how geeqie compares pictures and see if I can reproduce it using imagemagick...

porridge avatar Sep 07 '24 19:09 porridge

You have told findimagedupes to not allow even a single bit difference in the fingerprints (-t 100%) but there is a 1 bit difference (A =000000 and E = 000100), perhaps because of compression artefacts affecting the computation. You could try relaxing the threshold to allow somewhere between 1 and 5 bits (eg. -t 2b) and see if that helps without increasing false positives too much.

jhnc avatar Sep 09 '24 14:09 jhnc

Unfortunately I've been unable to find a value that provides a reasonable balance between false positives and negatives. 😞 In case anyone is interested, my solution was creating https://github.com/porridge/image-duplicate-finder It finds a.jpg and b.jpg to be only 85.45% similar, while c.jpg and d.jpg -- 99.99%.

porridge avatar Sep 14 '24 09:09 porridge

I'm glad you found an algorithm that works for your imges.

jhnc avatar Sep 14 '24 12:09 jhnc