duplicate-images
duplicate-images copied to clipboard
Added fuzzy image matching with pybktree and HEIF/HEIC support
- This adds a fuzzy matching of images based on Hamming distance
- All dependencies updated
- HEIF/HEIC support on MacOS and Linux
@bolshevik what is the way to determine the threshold? I've tried 10
with 10 thousand images and only 250 duplicates came out. Whereas with the original version (no pybktree) there were 1200 duplicates found.
@phanirithvij I believe that you are having many duplicates and actually you should see 250 groups of duplicates, but each group should contain more images unless there is a bug. Let me explain the idea: the images are hashed and represented by binary strings, with this fuzzy matching it is possible to find not only fully equal strings but also those having a number of bits flipped (10 in your case). At some point this will also lead to false positives (according to us humans, but not the math) but only if you take too many bits as a threshold. I was seeing up to 100 performing well (but again depends on the image set you have).
Take a look at this snippet:
import pybktree
distance = 0
input = [0, 0, 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 14]
tree = pybktree.BKTree(pybktree.hamming_distance, input)
for val in input:
matches = tree.find(val, distance)
print("{0:b}".format(val) + ": " + str(
["{1:b}".format(distance, matched) for distance, matched in matches])
)
When the distance is 0, it behaves exactly the same as the original implementation, but much slower, therefore I recommend to clean up exact matches using the original flow first, and then continue with fuzzy:
0: ['0', '0']
0: ['0', '0']
1: ['1', '1']
1: ['1', '1']
10: ['10']
11: ['11']
100: ['100']
101: ['101']
110: ['110']
111: ['111']
1000: ['1000']
1001: ['1001']
1010: ['1010']
1011: ['1011']
1100: ['1100']
1101: ['1101']
1110: ['1110', '1110']
1111: ['1111']
1110: ['1110', '1110']
But if you increase the distance to 2:
0: ['0', '0', '1', '1', '10', '100', '1000', '11', '101', '1100', '110', '1010', '1001']
0: ['0', '0', '1', '1', '10', '100', '1000', '11', '101', '1100', '110', '1010', '1001']
1: ['1', '1', '0', '0', '11', '101', '1001', '111', '10', '1011', '100', '1101', '1000']
1: ['1', '1', '0', '0', '11', '101', '1001', '111', '10', '1011', '100', '1101', '1000']
10: ['10', '0', '0', '11', '110', '1010', '1', '111', '1', '1011', '100', '1000', '1110', '1110']
11: ['11', '1', '111', '1', '10', '1011', '0', '0', '1111', '101', '110', '1010', '1001']
100: ['100', '0', '0', '101', '1100', '110', '1', '111', '1', '10', '1101', '1000', '1110', '1110']
101: ['101', '1', '111', '1', '100', '1101', '0', '0', '11', '1111', '1100', '110', '1001']
110: ['110', '111', '10', '100', '1110', '1110', '0', '0', '11', '1111', '101', '1100', '1010']
111: ['111', '11', '1111', '101', '110', '1', '1', '10', '1011', '100', '1101', '1110', '1110']
1000: ['1000', '0', '0', '1100', '1010', '1001', '1', '1', '10', '1011', '100', '1101', '1110', '1110']
1001: ['1001', '1', '1', '1011', '1101', '1000', '0', '0', '11', '1111', '101', '1100', '1010']
1010: ['1010', '10', '1011', '1000', '1110', '1110', '0', '0', '11', '1111', '1100', '110', '1001']
1011: ['1011', '11', '1111', '1010', '1001', '1', '111', '1', '10', '1101', '1000', '1110', '1110']
1100: ['1100', '100', '1101', '1000', '1110', '1110', '0', '0', '1111', '101', '110', '1010', '1001']
1101: ['1101', '1111', '101', '1100', '1001', '1', '111', '1', '1011', '100', '1000', '1110', '1110']
1110: ['1110', '1110', '1111', '1100', '110', '1010', '111', '10', '1011', '100', '1101', '1000']
1111: ['1111', '111', '1011', '1101', '1110', '1110', '11', '101', '1100', '110', '1010', '1001']
1110: ['1110', '1110', '1111', '1100', '110', '1010', '111', '10', '1011', '100', '1101', '1000']
You can see that it grows to the right a lot, then I am actually eliminating duplicated groups by ensuring that all values are appearing only once:
0: {'1001', '1010', '0', '10', '11', '101', '1000', '100', '110', '1100', '1'}
111: {'1111', '1101', '1110', '10', '11', '111', '101', '100', '110', '1011', '1'}
There still might be some cross-references, mind "1" or "11" in each list. This is happening because 0 and 111 are < 2 distant to 1 and 11
I hope it explains the idea, if you think there is a bug somewhere, please let me know.
Thanks for your reply. There were many images in the groups indeed.
I tried to test your pybktree implementation in my fork https://github.com/phanirithvij/duplicate-images because I couldn't directly run your fork on windows (magic module doesn't build on windows) so I had to make some additional changes.
But your pybktree implementation is, as it is copied.
Now there is a bug using zero threshold yields 0 duplicates
$ time py duplicate_finder.py find --threshold=0 --print
Started database...
Finding fuzzy duplicates, it might take a while... 0
100%[]
Number of duplicates: 0
Stopped database...
real 0m7.602s
user 0m0.000s
sys 0m0.016s
this is the original output.
$ time py duplicate_finder.py find --print
Number of duplicates: 1434
Stopped database...
real 0m7.625s
user 0m0.015s
sys 0m0.000s
PyBKTree takes roughly the same time but outputs nothing.
output for multiple thresholds for i in {0..15}; do time py duplicate_finder.py find --threshold=$i --print; done
expand
Rithvij@RithvijL /d/Projects/fun/pythone/duplicate-images (master)
$ for i in {0..15}; do time py duplicate_finder.py find --threshold=$i --print; done
Started database...
Finding fuzzy duplicates, it might take a while... 0
100%Number of duplicates: 0
Stopped database...
real 0m8.096s
user 0m0.000s
sys 0m0.015s
Started database...
Finding fuzzy duplicates, it might take a while... 1
100%Number of duplicates: 0
Stopped database...
real 0m8.055s
user 0m0.000s
sys 0m0.015s
Started database...
Finding fuzzy duplicates, it might take a while... 2
100%Number of duplicates: 124
Stopped database...
real 0m14.636s
user 0m0.000s
sys 0m0.000s
Started database...
Finding fuzzy duplicates, it might take a while... 3
100%Number of duplicates: 124
Stopped database...
real 0m14.597s
user 0m0.000s
sys 0m0.015s
Started database...
Finding fuzzy duplicates, it might take a while... 4
100%Number of duplicates: 209
Stopped database...
real 0m28.277s
user 0m0.000s
sys 0m0.015s
Started database...
Finding fuzzy duplicates, it might take a while... 5
100%Number of duplicates: 209
Stopped database...
real 0m28.244s
user 0m0.000s
sys 0m0.015s
Started database...
Finding fuzzy duplicates, it might take a while... 6
100%Number of duplicates: 241
Stopped database...
real 0m52.954s
user 0m0.000s
sys 0m0.015s
Started database...
Finding fuzzy duplicates, it might take a while... 7
100%Number of duplicates: 241
Stopped database...
real 0m52.360s
user 0m0.015s
sys 0m0.000s
Started database...
Finding fuzzy duplicates, it might take a while... 8
100%Number of duplicates: 280
Stopped database...
real 2m33.524s
user 0m0.015s
sys 0m0.046s
Started database...
Finding fuzzy duplicates, it might take a while... 9
100%Number of duplicates: 280
Stopped database...
real 2m47.336s
user 0m0.031s
sys 0m0.016s
Started database...
Finding fuzzy duplicates, it might take a while... 10
100%Number of duplicates: 305
Stopped database...
real 4m7.267s
user 0m0.016s
sys 0m0.031s
Started database...
Finding fuzzy duplicates, it might take a while... 11
100%Number of duplicates: 305
Stopped database...
real 4m7.897s
user 0m0.000s
sys 0m0.061s
Started database...
Finding fuzzy duplicates, it might take a while... 12
100%Number of duplicates: 324
Stopped database...
real 5m53.380s
user 0m0.031s
sys 0m0.047s
Started database...
Finding fuzzy duplicates, it might take a while... 13
100%Number of duplicates: 324
Stopped database...
real 5m49.925s
user 0m0.031s
sys 0m0.016s
Started database...
Finding fuzzy duplicates, it might take a while... 14
100%Number of duplicates: 341
Stopped database...
real 7m47.925s
user 0m0.000s
sys 0m0.015s
Started database...
Finding fuzzy duplicates, it might take a while... 15
100%Number of duplicates: 341
Stopped database...
real 7m47.621s
user 0m0.015s
sys 0m0.031s
Rithvij@RithvijL /d/Projects/fun/pythone/duplicate-images (master)
$ time py duplicate_finder.py find --print
Started database...
Number of duplicates: 1434
Stopped database...
real 0m5.680s
user 0m0.000s
sys 0m0.046s
@phanirithvij I think I have found the problem, could you please check https://github.com/bolshevik/duplicate-images/pull/1 . This should solve the wrong behavior when having many equal images.