dupeguru
dupeguru copied to clipboard
Picture mode: RGB threshold mode
I'd like a reliable mode for picture deduplication. For digital images, it often comes to very subtle differences, and any tool that tries to simplify each image by downscaling it or producing some kind of hash inevitably produces false positives. I see no differences between at 1% and 99% tolerance, where such option is available. I'm pretty sure the current 15x15 grid is outdated, but even if it can be made user-defined, it's still not a reliable option when it comes to very small differences.
I'm talking about differences in RGB values of individual pixels no bigger than (1,1,1) or (2,2,2). Consider there are two high-resolution images that have every pixel identical or different by one RGB unit, like (0,1,0), (-1,0,-1) etc. These two pictures will be pretty much identical to the eye. At setting "100", current dupeguru will fail to identify them. Then, if there is a derivative image with a certain region (a group of up to, say, ~50000 pixels) with distinctly different colors, at "99" setting it will erroneously see it as a duplicate. The RGB threshold mode will take care of this situation - if set to (1,1,1) it will correctly mark the duplicate, and ignore the third image.
Not sure if I understand what you're trying to suggest.
Also I'm curious to know where did you take the 15x15 grid? I had a quick look at the algorithm a few weeks back, and if my memory serves right, it is mostly based on colour averages. The image is indeed divided into chunks though. But it's not the typical perceptual diff you'd expect. Not the most efficient algorithm either if you ask me, definitely needs improvement.
I was contemplating improving the whole thing by maybe importing projects like imagedup or pHash but that will require some work.
Guess better just to demonstrate the issue. Here are 3 files: https://mega.nz/folder/RioRgSbB#bml9QJNcEhwcWTR5R3Fb1Q With current dupeguru, you can't make it detect only one dupe here - even at "filter hardness" = 99 it will see the third image as a dupe, though it's distinctly different to the eye.
Actually, I think current implementation of settings like "filter hardness" / "threshold" / "tolerance" where you let the user set a value between 0 and 100 are bad and misleading. Without proper understanding of underlying algorithms, he can't really know what would best suit his needs. "I want to remove photos that look 50% same" or "I want to remove digital drafts that have less than 10% new lines" or "I want to remove pictures that differ in less than 20% pixels". Current implementations in every single GUI deduplication app I've tried just can not satisfy any of these theoretical requirements. In absolute most cases if user is careful about accidental deletion of stuff he needs, he will need to review the results by hand. With big collections of images it doesn't save a lot of time.
What I propose is the case where "I want to remove duplicates of digital images that look exactly same to my eye, and to my eye colors that differ by less than (5,5,5) in RGB ranges look exactly same". If I can set such a threshold and know that the app will compare each pixel in the images - it's a case with a requirement that is easy to understand (and implement), and the result will 100% match my expectations and be 100% accurate.
By RGB range I mean the usual (0,0,0)~(255,255,255) color range, and by (5,5,5) threshold I mean that each 2 corresponding pixels in 2 different images with same resolution are allowed to be different by any value between (-5,-5,-5) and (5,5,5) for these 2 images to be identified as duplicates.
Also I'm curious to know where did you take the 15x15 grid?
Here https://github.com/arsenetar/dupeguru/blob/master/help/en/scan.rst#picture-blocks
I wrote a script that does what I need, and it works pretty good after some optimizations. I'll describe the logic below in case it's useful for future implementations.
- We have folder1 (reference) and folder2 (junk with suspected dupes to sort out). Dupes inside individual folder are not processed in my script.
- Read all images from both folder1 and folder2, in order to collect the following data into the memory chunk for fast analyzing:
- filepath
- resolution
- RGB values of N^2 equidistant pixels from each image (N=5 for me), which effectively means we get data equal to NxN (5x5 for me) thumbnail for each image, but without doing any downscaling work.
- The data chunk we get is very compact, <5 MB for a folder with thousands of images. We can save it as a file to load later without reading all the files again, because initial reading of all the images is the longest part.
- We now have 2 chunks that we can quickly compare (each to each). For each pair, if resolution is exactly same and all N^2 pixels are same within RGB threshold (I chose (9,9,9) threshold) - we get suspected original + duplicate pair.
- For each suspected original + duplicate pair, we compare each P(xy) pixel, by which I mean we check every pixel (again with the same RGB threshold) but with step P. Making P=1 would mean checking every single pixel in all cases, and making P=20 would mean checking the amount of pixels equal to the size of 100%/20=5% of the image resolution. By doing this we:
- accelerate processing of detailed comparison by a factor of P.
- assume that for each original image, a variation of it has to have a visible RGB difference on an area bigger than PxP pixels. We don't expect that for given collection of reference images, variations can be found that we care for where such small differences are not covered by PxP pixel grid.
- Mark dupes accordingly. That's all.
- Increasing N will increase the size of data chunks but decrease false positives on initial quick comparison.
- Increasing P will increase speed of direct comparison but decrease accuracy. It's logically to increase it and still expect accurate results for high resolution images. It could also be viable to calculate P in each case depending on image resolution, forcing direct comparison to process exact same amount of pixels regardless of image resolution, which will make the speed consistent, though probably a bit wasteful for lower resolution images.
- Understanding RGB threshold is important. For my case, initially I thought using (2,2,2) would be enough to filter out most of the dupes, but having images as PNG in folder1 and JPEG in folder2 shows that JPEG colors may be widely different depending on quality preset used. Ended up using (9,9,9) which worked well for me.
- Also, RGB threshold processing might be different. My script just compares each color channel and if at least one differs by more than 9 - it's not a dupe. There may be some logic in combining differences in all channels and then checking against certain numeric threshold. Some kind of statistical study about how much RGB threshold is affected by JPEG quality preset may help with such implementations.