dupeguru
dupeguru copied to clipboard
Actually, it won't find duplicate files
I just found a duplicate file in my computer in a folder I clearly remember scanning with DupeGuru before. I was suprised, so I scanned it again, but it said no duplicites found. I tried to compare the file with diffnow dot com to make sure they are actually the same. And it confirmed they are indeed the same. How come DupeGuru skipped them?
It's not that common file though. It's .jnlp from software "Sweet Home" that is used to design a room in 3D.
But I am pretty sure it should be detect by DupeGuru.
- OS: Windows 10
- Version 4.0.4
Thanks
I have found many, while comparing images by content.
I have currently zero experience with DupeGuru which I just learned about (from there), so it's a shot in the dark, but : – Those files could be hard links from each other, and DupeGuru may be configured to ignore hard links. – There may be a filter based on files' attributes (for instance DoubleKiller which is currently my go-to duplicate detector has filters activated by default which prevent scanning files / folders with the “System” attribute ; it is very streamlined yet very efficient, but hasn't been updated in a long time and has some major caveats : can't deal with Unicode characters, doesn't recognize hard-linked files ; I also use AllDup, which doesn't have these caveats, but has a quite cluttered / unwieldy interface). – You stated that you “remember” scanning that folder with DupeGuru, but, generally speaking, “remembering” is not a reliable way of assessing how something works or fails to work ; you would have to at least run a scan after you noticed a possible malfunction before you can indeed ascertain that there is one to report, ensuring that you made no mistake and that you selected the proper settings to get the intended result.
I have the following sample - downloaded from some wallpaper sites:
Those are not recognized as duplicates when set as content image scan with 80% threshold. Some other are, even though they have consistent size differences (like 1024 vs 1600 px on a side). Actually, I am not sure what should be the right parameterization.
@msdobrescu thanks for the report. I tested with these two files you posted with Filter Hardness set at 70%, and I get results back with 83% match. Can you post which options you have enabled? Maybe make sure you have enabled "Match pictures of different dimensions" for example, and check your filtering rules.
Edit: indeed with Filter Hardness set at 80% there is no result. Might need some investigations.
To note: the fact that these images are very bright overall (less contrasted areas) might have something to do with the very low match rate.
Application mode is Picture.
I've generally seen matches due to statistically predominant colour rather than no matching.
Another interesting sample, same settings:
The differences are in the snow on the ground (I've used WinMerge to check), and colours tints.
Those are not matched either, one is a crop of another:
Anyway I was a little bit hopig from an aswer from devs... anyway here are examples of two files that are the same, but DupeGuru doesn't detect. Is ther any difference? Sweet Home.zip
Oh, maybe I just figured it out...
It ignored files smaller than 10 kb
I see it now!
Another pair:
Another pair:
Differences are minimal for those:
I would have expected to find those (small crop):
Been using for a few days now and was very happy until i ran into this missed matching .... one pair out of four missed matches out of a total of 8 jpg files in the same directory (no need to post them all).
filename:EMEAM005.jpg
filename: EMEAM0051.jpg
DebugLog:-
2021-03-29 09:46:17,441 - DEBUG - Collected 8 files in folder /home/jedaa/Pictures/OFH/testdupes 2021-03-29 09:46:17,460 - DEBUG - Collected 8 files in folder /home/jedaa/Pictures/OFH/testdupes 2021-03-29 09:46:17,460 - INFO - Scanning 8 files 2021-03-29 09:46:17,461 - INFO - Getting matches. Scan type: 10 2021-03-29 09:46:17,462 - DEBUG - Analyzing picture at /home/jedaa/Pictures/OFH/testdupes/EMEAM004.jpg 2021-03-29 09:46:17,462 - DEBUG - Analyzing picture at /home/jedaa/Pictures/OFH/testdupes/EMEAM001.jpg 2021-03-29 09:46:17,462 - DEBUG - Analyzing picture at /home/jedaa/Pictures/OFH/testdupes/EMEAM002.jpg 2021-03-29 09:46:17,462 - DEBUG - Analyzing picture at /home/jedaa/Pictures/OFH/testdupes/EMEAM006.jpg 2021-03-29 09:46:17,462 - DEBUG - Analyzing picture at /home/jedaa/Pictures/OFH/testdupes/EMEAM0051.jpg 2021-03-29 09:46:17,462 - DEBUG - Analyzing picture at /home/jedaa/Pictures/OFH/testdupes/EMEAM0031.jpg 2021-03-29 09:46:17,463 - DEBUG - Analyzing picture at /home/jedaa/Pictures/OFH/testdupes/EMEAM003.jpg 2021-03-29 09:46:17,463 - DEBUG - Analyzing picture at /home/jedaa/Pictures/OFH/testdupes/EMEAM005.jpg 2021-03-29 09:46:17,485 - INFO - Creating 8 chunks with a chunk size of 100 for 8 pictures 2021-03-29 09:46:17,588 - INFO - Found 0 matches 2021-03-29 09:46:17,588 - INFO - Grouping matches 2021-03-29 09:46:17,588 - INFO - Created 0 groups
It is all about confidence! .. If it has missed these 4 (as it turns out) .. it begs the question how many others are there ... if i need to go check visually it sort of defeats the purpose of use :(
-rwxrwxrwx 1 jedaa users 36K May 8 2009 EMEAM001.jpg -rwxrwxrwx 1 jedaa users 39K May 8 2009 EMEAM002.jpg -rwxrwxrwx 1 jedaa users 22K May 8 2009 EMEAM0031.jpg -rwxrwxrwx 1 jedaa users 23K May 8 2009 EMEAM003.jpg -rwxrwxrwx 1 jedaa users 29K May 8 2009 EMEAM004.jpg -rwxrwxrwx 1 jedaa users 80K May 8 2009 EMEAM0051.jpg -rwxrwxrwx 1 jedaa users 78K May 8 2009 EMEAM005.jpg -rwxrwxrwx 1 jedaa users 31K May 8 2009 EMEAM006.jpg
I am at a loss as where to go from here ... perhaps somebody can suggest config settings of some sort?
Thanks :)
It would be interesting to know how other duplicate detectors behave with these specific files. Have you tried with https://github.com/qarmin/czkawka ?
Yes .. i had already tested them with imgdupes prior to creating this comment. I will also look at the garmin utility.
testdupes $ imgdupes /home/jedaa/Pictures/OFH/testdupes/ phash 4 Building NGT index (dimension=64, num_proc=3)
Approximate neighbor searching using NGT 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 28411.88it/s] /home/jedaa/Pictures/OFH/testdupes/EMEAM002.jpg /home/jedaa/Pictures/OFH/testdupes/EMEAM001.jpg . /home/jedaa/Pictures/OFH/testdupes/EMEAM003.jpg /home/jedaa/Pictures/OFH/testdupes/EMEAM0031.jpg . /home/jedaa/Pictures/OFH/testdupes/EMEAM004.jpg /home/jedaa/Pictures/OFH/testdupes/EMEAM006.jpg . /home/jedaa/Pictures/OFH/testdupes/EMEAM0051.jpg /home/jedaa/Pictures/OFH/testdupes/EMEAM005.jpg .
I have noticed the same problem - dupeguru missing similar duplicates of files.
I have been going through an archive of ca. 10000 imaIges from a family archive and managed to weed out most duplicates with dupeguru doing a GREAT job.
After some months I'm now almost finished (down to around 5000 images) and it is becoming clearer that certain images are "never" matched despite being very similar - in fact they are the physical photos from the SAME negatives but having slight differences in their reproduction at the time (the centers of the photos are slightly different, brightness different etc. all judgements made when the photos were being produced) before they were scanned. It also seems to especially affect particular kinds of kodak photos from the the 1980s... with perhaps a limited colour range (I am not an expert).
Here is an example:
The algorithm in dupeguru does not detect a match - even at a filter hardness of 20...
But there are algorithms that do work. I found using imagededup (https://github.com/idealo/imagededup) with CNN (Convolutional Neural Network) works very well - see the report below::
My question is: could CNN be added (perhaps as an option) to dupeguru?
Quick note: of all the pictures posted in this thread, only the photos posted by @griadooss did not show up in the results with a filter hardness of 1 (minimum). All the other files were successfully detected by Dupeguru with match percentages between 35 and 83.
Ideally the match percentage should probably be higher for all of these.
Quick note: of all the pictures posted in this thread, only the photos posted by @griadooss did not show up in the results with a filter hardness of 1 (minimum). All the other files were successfully detected by Dupeguru with match percentages between 35 and 83.
Ideally the match percentage should probably be higher for all of these.
With 10000 images (as I mentioned) lowering match percentages as you suggest just leads to 1000s of false matches. I have now moved to CNN for all of my dedupe work