dupeguru icon indicating copy to clipboard operation
dupeguru copied to clipboard

Actually, it won't find duplicate files

Open pollux666 opened this issue 4 years ago • 19 comments

I just found a duplicate file in my computer in a folder I clearly remember scanning with DupeGuru before. I was suprised, so I scanned it again, but it said no duplicites found. I tried to compare the file with diffnow dot com to make sure they are actually the same. And it confirmed they are indeed the same. How come DupeGuru skipped them?

It's not that common file though. It's .jnlp from software "Sweet Home" that is used to design a room in 3D.

But I am pretty sure it should be detect by DupeGuru.

  • OS: Windows 10
  • Version 4.0.4

Thanks

pollux666 avatar Dec 31 '20 11:12 pollux666

I have found many, while comparing images by content.

msdobrescu avatar Jan 05 '21 07:01 msdobrescu

I have currently zero experience with DupeGuru which I just learned about (from there), so it's a shot in the dark, but : – Those files could be hard links from each other, and DupeGuru may be configured to ignore hard links. – There may be a filter based on files' attributes (for instance DoubleKiller which is currently my go-to duplicate detector has filters activated by default which prevent scanning files / folders with the “System” attribute ; it is very streamlined yet very efficient, but hasn't been updated in a long time and has some major caveats : can't deal with Unicode characters, doesn't recognize hard-linked files ; I also use AllDup, which doesn't have these caveats, but has a quite cluttered / unwieldy interface). – You stated that you “remember” scanning that folder with DupeGuru, but, generally speaking, “remembering” is not a reliable way of assessing how something works or fails to work ; you would have to at least run a scan after you noticed a possible malfunction before you can indeed ascertain that there is one to report, ensuring that you made no mistake and that you selected the proper settings to get the intended result.

abolibibelot1980 avatar Jan 22 '21 22:01 abolibibelot1980

I have the following sample - downloaded from some wallpaper sites:

Aero Wallpaper Pack 15-04 Winter wallpapers 38

Those are not recognized as duplicates when set as content image scan with 80% threshold. Some other are, even though they have consistent size differences (like 1024 vs 1600 px on a side). Actually, I am not sure what should be the right parameterization.

msdobrescu avatar Feb 07 '21 20:02 msdobrescu

@msdobrescu thanks for the report. I tested with these two files you posted with Filter Hardness set at 70%, and I get results back with 83% match. Can you post which options you have enabled? Maybe make sure you have enabled "Match pictures of different dimensions" for example, and check your filtering rules.

Edit: indeed with Filter Hardness set at 80% there is no result. Might need some investigations.

To note: the fact that these images are very bright overall (less contrasted areas) might have something to do with the very low match rate.

glubsy avatar Feb 08 '21 16:02 glubsy

image Application mode is Picture.

I've generally seen matches due to statistically predominant colour rather than no matching.

msdobrescu avatar Feb 08 '21 16:02 msdobrescu

Another interesting sample, same settings: snow-tree-1600 Change of Season, Sommer-Linde The differences are in the snow on the ground (I've used WinMerge to check), and colours tints.

msdobrescu avatar Feb 14 '21 14:02 msdobrescu

Those are not matched either, one is a crop of another: Landscape (13) AG-PhotoCollection-163 (6)

msdobrescu avatar Feb 14 '21 14:02 msdobrescu

Anyway I was a little bit hopig from an aswer from devs... anyway here are examples of two files that are the same, but DupeGuru doesn't detect. Is ther any difference? Sweet Home.zip

pollux666 avatar Feb 14 '21 15:02 pollux666

Oh, maybe I just figured it out...

It ignored files smaller than 10 kb

I see it now!

pollux666 avatar Feb 14 '21 15:02 pollux666

Another pair: 4_hi-res_winter-wallpapers_011 Airena wallapack 148 (24)

msdobrescu avatar Feb 14 '21 16:02 msdobrescu

Another pair: Airena wallapack 78 (16) Amazing Desktop Wallpapers 66 27

msdobrescu avatar Feb 20 '21 08:02 msdobrescu

Differences are minimal for those: Amazing Desktop Wallpapers 66 49 Airena wallapack 130 (18)

msdobrescu avatar Feb 20 '21 09:02 msdobrescu

I would have expected to find those (small crop):

AG-PhotoCollection-151 (10) Amazing Waterscapes Wallpapers (164)

msdobrescu avatar Mar 20 '21 08:03 msdobrescu

Been using for a few days now and was very happy until i ran into this missed matching .... one pair out of four missed matches out of a total of 8 jpg files in the same directory (no need to post them all).

EMEAM005 filename:EMEAM005.jpg

EMEAM0051 filename: EMEAM0051.jpg DebugLog:-

2021-03-29 09:46:17,441 - DEBUG - Collected 8 files in folder /home/jedaa/Pictures/OFH/testdupes 2021-03-29 09:46:17,460 - DEBUG - Collected 8 files in folder /home/jedaa/Pictures/OFH/testdupes 2021-03-29 09:46:17,460 - INFO - Scanning 8 files 2021-03-29 09:46:17,461 - INFO - Getting matches. Scan type: 10 2021-03-29 09:46:17,462 - DEBUG - Analyzing picture at /home/jedaa/Pictures/OFH/testdupes/EMEAM004.jpg 2021-03-29 09:46:17,462 - DEBUG - Analyzing picture at /home/jedaa/Pictures/OFH/testdupes/EMEAM001.jpg 2021-03-29 09:46:17,462 - DEBUG - Analyzing picture at /home/jedaa/Pictures/OFH/testdupes/EMEAM002.jpg 2021-03-29 09:46:17,462 - DEBUG - Analyzing picture at /home/jedaa/Pictures/OFH/testdupes/EMEAM006.jpg 2021-03-29 09:46:17,462 - DEBUG - Analyzing picture at /home/jedaa/Pictures/OFH/testdupes/EMEAM0051.jpg 2021-03-29 09:46:17,462 - DEBUG - Analyzing picture at /home/jedaa/Pictures/OFH/testdupes/EMEAM0031.jpg 2021-03-29 09:46:17,463 - DEBUG - Analyzing picture at /home/jedaa/Pictures/OFH/testdupes/EMEAM003.jpg 2021-03-29 09:46:17,463 - DEBUG - Analyzing picture at /home/jedaa/Pictures/OFH/testdupes/EMEAM005.jpg 2021-03-29 09:46:17,485 - INFO - Creating 8 chunks with a chunk size of 100 for 8 pictures 2021-03-29 09:46:17,588 - INFO - Found 0 matches 2021-03-29 09:46:17,588 - INFO - Grouping matches 2021-03-29 09:46:17,588 - INFO - Created 0 groups

It is all about confidence! .. If it has missed these 4 (as it turns out) .. it begs the question how many others are there ... if i need to go check visually it sort of defeats the purpose of use :(

-rwxrwxrwx 1 jedaa users 36K May 8 2009 EMEAM001.jpg -rwxrwxrwx 1 jedaa users 39K May 8 2009 EMEAM002.jpg -rwxrwxrwx 1 jedaa users 22K May 8 2009 EMEAM0031.jpg -rwxrwxrwx 1 jedaa users 23K May 8 2009 EMEAM003.jpg -rwxrwxrwx 1 jedaa users 29K May 8 2009 EMEAM004.jpg -rwxrwxrwx 1 jedaa users 80K May 8 2009 EMEAM0051.jpg -rwxrwxrwx 1 jedaa users 78K May 8 2009 EMEAM005.jpg -rwxrwxrwx 1 jedaa users 31K May 8 2009 EMEAM006.jpg

dupeguru01

I am at a loss as where to go from here ... perhaps somebody can suggest config settings of some sort?

Thanks :)

griadooss avatar Mar 28 '21 22:03 griadooss

It would be interesting to know how other duplicate detectors behave with these specific files. Have you tried with https://github.com/qarmin/czkawka ?

glubsy avatar Mar 28 '21 23:03 glubsy

Yes .. i had already tested them with imgdupes prior to creating this comment. I will also look at the garmin utility.

testdupes $ imgdupes /home/jedaa/Pictures/OFH/testdupes/ phash 4 Building NGT index (dimension=64, num_proc=3)
Approximate neighbor searching using NGT 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 28411.88it/s] /home/jedaa/Pictures/OFH/testdupes/EMEAM002.jpg /home/jedaa/Pictures/OFH/testdupes/EMEAM001.jpg . /home/jedaa/Pictures/OFH/testdupes/EMEAM003.jpg /home/jedaa/Pictures/OFH/testdupes/EMEAM0031.jpg . /home/jedaa/Pictures/OFH/testdupes/EMEAM004.jpg /home/jedaa/Pictures/OFH/testdupes/EMEAM006.jpg . /home/jedaa/Pictures/OFH/testdupes/EMEAM0051.jpg /home/jedaa/Pictures/OFH/testdupes/EMEAM005.jpg .

griadooss avatar Mar 29 '21 01:03 griadooss

I have noticed the same problem - dupeguru missing similar duplicates of files.

I have been going through an archive of ca. 10000 imaIges from a family archive and managed to weed out most duplicates with dupeguru doing a GREAT job.

After some months I'm now almost finished (down to around 5000 images) and it is becoming clearer that certain images are "never" matched despite being very similar - in fact they are the physical photos from the SAME negatives but having slight differences in their reproduction at the time (the centers of the photos are slightly different, brightness different etc. all judgements made when the photos were being produced) before they were scanned. It also seems to especially affect particular kinds of kodak photos from the the 1980s... with perhaps a limited colour range (I am not an expert).

Here is an example:

Version (1)

Version (2)

The algorithm in dupeguru does not detect a match - even at a filter hardness of 20...

But there are algorithms that do work. I found using imagededup (https://github.com/idealo/imagededup) with CNN (Convolutional Neural Network) works very well - see the report below::

imagededup report

My question is: could CNN be added (perhaps as an option) to dupeguru?

james-cook avatar May 07 '21 09:05 james-cook

Quick note: of all the pictures posted in this thread, only the photos posted by @griadooss did not show up in the results with a filter hardness of 1 (minimum). All the other files were successfully detected by Dupeguru with match percentages between 35 and 83.

Ideally the match percentage should probably be higher for all of these.

glubsy avatar Jun 22 '21 22:06 glubsy

Quick note: of all the pictures posted in this thread, only the photos posted by @griadooss did not show up in the results with a filter hardness of 1 (minimum). All the other files were successfully detected by Dupeguru with match percentages between 35 and 83.

Ideally the match percentage should probably be higher for all of these.

With 10000 images (as I mentioned) lowering match percentages as you suggest just leads to 1000s of false matches. I have now moved to CNN for all of my dedupe work

james-cook avatar Jun 23 '21 16:06 james-cook