shimmie2 icon indicating copy to clipboard operation
shimmie2 copied to clipboard

Duplicate/similar image checking

Open IntellectualPotato opened this issue 10 months ago • 2 comments

Is your feature request related to a problem? Please describe. I worry about as time goes on, potentially having duplicate uploads, and it would be nice to be able to check for similar images too, incase any parent/child links need to be made too, or if there is a lower quality version of an image that could be deleted, i would love to avoid duplicates or lower quality dupe posts when they are unneeded! 👀

Describe the solution you'd like I would like if there was some built in way to check for duplicate files, and even similar files, or atleast similar images, this is one of the most notable things on my mind

a tool i've used before trying to move images to shimmie2 was "czkawka"/"czkawka gui" or such, which works perfectly, however, it sadly doesn't seem to read the extension-less files shimmie2 uses to store files it seems (for similar image checking), but maybe its a good thing to keep in mind, and maybe it can be used/implemented somehow?

thats all, thank you! 😄

IntellectualPotato avatar Feb 08 '25 09:02 IntellectualPotato

I actually have made something for this, using openai's CLIP to find similar images. It has

  • 'reverse image search', so find visually similar images on the site.
  • automated duplicate detection on the upload page (only works with my custom theme atm).
  • and recently added 'descriptive text search', as alternative to tag based search, but much less accurate.

you can find the code here: https://github.com/Mjokfox/shimmie2/tree/Fork/ext/reverse_image

It uses python for CLIP, which runs as separate program in the background, listening by default on port 10017. It works quite well for finding duplicates, but i have to say its very inefficient given that users now basically upload their image twice, once for checking if its a duplicate, and then again for the actual upload. And ive found that people dont pay much attention to it annoyingly, but that might be more a frontend issue.

Is this of any interest? I could improve it so it can work with the default theme if so.

If you want to, you can try the reverse image search on my site, and the text based search as well of course. To see the upload duplicate detection you would need to make an account first

Mjokfox avatar Feb 08 '25 18:02 Mjokfox

For reference, Hydrus uses a phash for duplicate detection:

If you are interested, the current version of this system uses a 64-bit phash to represent the image shape and a VPTree to search different files' phashes' relative hamming distance. I expect to extend it in future with multiple phash generation (flips, rotations, and 'interesting' image crops and video frames) and most-common colour comparisons.

discomrade avatar Feb 09 '25 21:02 discomrade