duplicate-images icon indicating copy to clipboard operation
duplicate-images copied to clipboard

Added fuzzy image matching with pybktree and HEIF/HEIC support

Open bolshevik opened this issue 6 years ago • 4 comments

  • This adds a fuzzy matching of images based on Hamming distance
  • All dependencies updated
  • HEIF/HEIC support on MacOS and Linux

bolshevik avatar Jan 19 '19 18:01 bolshevik

@bolshevik what is the way to determine the threshold? I've tried 10 with 10 thousand images and only 250 duplicates came out. Whereas with the original version (no pybktree) there were 1200 duplicates found.

phanirithvij avatar Jan 22 '21 03:01 phanirithvij

@phanirithvij I believe that you are having many duplicates and actually you should see 250 groups of duplicates, but each group should contain more images unless there is a bug. Let me explain the idea: the images are hashed and represented by binary strings, with this fuzzy matching it is possible to find not only fully equal strings but also those having a number of bits flipped (10 in your case). At some point this will also lead to false positives (according to us humans, but not the math) but only if you take too many bits as a threshold. I was seeing up to 100 performing well (but again depends on the image set you have).

Take a look at this snippet:

import pybktree

distance = 0
input = [0, 0, 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 14]

tree = pybktree.BKTree(pybktree.hamming_distance, input)

for val in input:
    matches = tree.find(val, distance)
    print("{0:b}".format(val) + ": " + str(
        ["{1:b}".format(distance, matched) for distance, matched in matches])
    )

When the distance is 0, it behaves exactly the same as the original implementation, but much slower, therefore I recommend to clean up exact matches using the original flow first, and then continue with fuzzy:

0: ['0', '0']
0: ['0', '0']
1: ['1', '1']
1: ['1', '1']
10: ['10']
11: ['11']
100: ['100']
101: ['101']
110: ['110']
111: ['111']
1000: ['1000']
1001: ['1001']
1010: ['1010']
1011: ['1011']
1100: ['1100']
1101: ['1101']
1110: ['1110', '1110']
1111: ['1111']
1110: ['1110', '1110']

But if you increase the distance to 2:

0: ['0', '0', '1', '1', '10', '100', '1000', '11', '101', '1100', '110', '1010', '1001']
0: ['0', '0', '1', '1', '10', '100', '1000', '11', '101', '1100', '110', '1010', '1001']
1: ['1', '1', '0', '0', '11', '101', '1001', '111', '10', '1011', '100', '1101', '1000']
1: ['1', '1', '0', '0', '11', '101', '1001', '111', '10', '1011', '100', '1101', '1000']
10: ['10', '0', '0', '11', '110', '1010', '1', '111', '1', '1011', '100', '1000', '1110', '1110']
11: ['11', '1', '111', '1', '10', '1011', '0', '0', '1111', '101', '110', '1010', '1001']
100: ['100', '0', '0', '101', '1100', '110', '1', '111', '1', '10', '1101', '1000', '1110', '1110']
101: ['101', '1', '111', '1', '100', '1101', '0', '0', '11', '1111', '1100', '110', '1001']
110: ['110', '111', '10', '100', '1110', '1110', '0', '0', '11', '1111', '101', '1100', '1010']
111: ['111', '11', '1111', '101', '110', '1', '1', '10', '1011', '100', '1101', '1110', '1110']
1000: ['1000', '0', '0', '1100', '1010', '1001', '1', '1', '10', '1011', '100', '1101', '1110', '1110']
1001: ['1001', '1', '1', '1011', '1101', '1000', '0', '0', '11', '1111', '101', '1100', '1010']
1010: ['1010', '10', '1011', '1000', '1110', '1110', '0', '0', '11', '1111', '1100', '110', '1001']
1011: ['1011', '11', '1111', '1010', '1001', '1', '111', '1', '10', '1101', '1000', '1110', '1110']
1100: ['1100', '100', '1101', '1000', '1110', '1110', '0', '0', '1111', '101', '110', '1010', '1001']
1101: ['1101', '1111', '101', '1100', '1001', '1', '111', '1', '1011', '100', '1000', '1110', '1110']
1110: ['1110', '1110', '1111', '1100', '110', '1010', '111', '10', '1011', '100', '1101', '1000']
1111: ['1111', '111', '1011', '1101', '1110', '1110', '11', '101', '1100', '110', '1010', '1001']
1110: ['1110', '1110', '1111', '1100', '110', '1010', '111', '10', '1011', '100', '1101', '1000']

You can see that it grows to the right a lot, then I am actually eliminating duplicated groups by ensuring that all values are appearing only once:

0: {'1001', '1010', '0', '10', '11', '101', '1000', '100', '110', '1100', '1'}
111: {'1111', '1101', '1110', '10', '11', '111', '101', '100', '110', '1011', '1'}

There still might be some cross-references, mind "1" or "11" in each list. This is happening because 0 and 111 are < 2 distant to 1 and 11

I hope it explains the idea, if you think there is a bug somewhere, please let me know.

bolshevik avatar Jan 22 '21 07:01 bolshevik

Thanks for your reply. There were many images in the groups indeed.

I tried to test your pybktree implementation in my fork https://github.com/phanirithvij/duplicate-images because I couldn't directly run your fork on windows (magic module doesn't build on windows) so I had to make some additional changes.

But your pybktree implementation is, as it is copied.

Now there is a bug using zero threshold yields 0 duplicates

$ time py duplicate_finder.py find --threshold=0 --print
Started database...
Finding fuzzy duplicates, it might take a while... 0
100%[]
Number of duplicates: 0
Stopped database...

real    0m7.602s
user    0m0.000s
sys     0m0.016s

this is the original output.

$ time py duplicate_finder.py find --print
Number of duplicates: 1434
Stopped database...

real    0m7.625s
user    0m0.015s
sys     0m0.000s

PyBKTree takes roughly the same time but outputs nothing.

output for multiple thresholds for i in {0..15}; do time py duplicate_finder.py find --threshold=$i --print; done

expand
Rithvij@RithvijL  /d/Projects/fun/pythone/duplicate-images (master)
$ for i in {0..15}; do time py duplicate_finder.py find --threshold=$i --print; done
Started database...
Finding fuzzy duplicates, it might take a while... 0
100%Number of duplicates: 0
Stopped database...

real    0m8.096s
user    0m0.000s
sys     0m0.015s
Started database...
Finding fuzzy duplicates, it might take a while... 1
100%Number of duplicates: 0
Stopped database...

real    0m8.055s
user    0m0.000s
sys     0m0.015s
Started database...
Finding fuzzy duplicates, it might take a while... 2
100%Number of duplicates: 124
Stopped database...

real    0m14.636s
user    0m0.000s
sys     0m0.000s
Started database...
Finding fuzzy duplicates, it might take a while... 3
100%Number of duplicates: 124
Stopped database...

real    0m14.597s
user    0m0.000s
sys     0m0.015s
Started database...
Finding fuzzy duplicates, it might take a while... 4
100%Number of duplicates: 209
Stopped database...

real    0m28.277s
user    0m0.000s
sys     0m0.015s
Started database...
Finding fuzzy duplicates, it might take a while... 5
100%Number of duplicates: 209
Stopped database...

real    0m28.244s
user    0m0.000s
sys     0m0.015s
Started database...
Finding fuzzy duplicates, it might take a while... 6
100%Number of duplicates: 241
Stopped database...

real    0m52.954s
user    0m0.000s
sys     0m0.015s
Started database...
Finding fuzzy duplicates, it might take a while... 7
100%Number of duplicates: 241
Stopped database...

real    0m52.360s
user    0m0.015s
sys     0m0.000s
Started database...
Finding fuzzy duplicates, it might take a while... 8
100%Number of duplicates: 280
Stopped database...

real    2m33.524s
user    0m0.015s
sys     0m0.046s
Started database...
Finding fuzzy duplicates, it might take a while... 9
100%Number of duplicates: 280
Stopped database...

real    2m47.336s
user    0m0.031s
sys     0m0.016s
Started database...
Finding fuzzy duplicates, it might take a while... 10
100%Number of duplicates: 305
Stopped database...

real    4m7.267s
user    0m0.016s
sys     0m0.031s
Started database...
Finding fuzzy duplicates, it might take a while... 11
100%Number of duplicates: 305
Stopped database...

real    4m7.897s
user    0m0.000s
sys     0m0.061s
Started database...
Finding fuzzy duplicates, it might take a while... 12
100%Number of duplicates: 324
Stopped database...

real    5m53.380s
user    0m0.031s
sys     0m0.047s
Started database...
Finding fuzzy duplicates, it might take a while... 13
100%Number of duplicates: 324
Stopped database...

real    5m49.925s
user    0m0.031s
sys     0m0.016s
Started database...
Finding fuzzy duplicates, it might take a while... 14
100%Number of duplicates: 341
Stopped database...

real    7m47.925s
user    0m0.000s
sys     0m0.015s
Started database...
Finding fuzzy duplicates, it might take a while... 15
100%Number of duplicates: 341
Stopped database...

real    7m47.621s
user    0m0.015s
sys     0m0.031s

Rithvij@RithvijL  /d/Projects/fun/pythone/duplicate-images (master)
$ time py duplicate_finder.py find --print
Started database...
Number of duplicates: 1434
Stopped database...

real    0m5.680s
user    0m0.000s
sys     0m0.046s

phanirithvij avatar Jan 22 '21 14:01 phanirithvij

@phanirithvij I think I have found the problem, could you please check https://github.com/bolshevik/duplicate-images/pull/1 . This should solve the wrong behavior when having many equal images.

bolshevik avatar Jan 22 '21 21:01 bolshevik