cbird icon indicating copy to clipboard operation
cbird copied to clipboard

example: "simple" dedupe usage

Open chapmanjacobd opened this issue 2 years ago • 3 comments

After indexing 10 million images, there are so many options I'd just like to pick something relatively sane/conservative and apply that selection rather than manually go through each duplicate.

Is there an easy way to select/nuke all duplicates after indexing with -i.algos 1 -update and of the copies in the group preserve only the one with the highest resolution or compressionRatio ?

chapmanjacobd avatar Jul 19 '23 02:07 chapmanjacobd

Automatic deletion for potentially mismatched results (there can always some even at low thresholds) hasn't been a consideration yet. When I have a lot of deletions to make I turn on the difference image (Z) and zip through them.

But as for your use case. My first thought was you could sort the result groups, look through them to make sure your idea is sane, then use -first -nuke to take out the worst/lowest one. Then repeat this until none remain.

However sorting for result groups is not implemented - they are always sorted by score. This is simple to add, but for now it means you can't try this.

There is a problem with this idea (besides the potential to delete false matches), which is the metric to select the "best" duplicate. For example because of up-scaling a higher resolution file might look worse. Or because of sharpen filter a lower compression might be a worse. Or maybe you don't care and either is fine for the application (e.g. ML training)

I have an experimental "quality score" metric to try to solve this, you have to press "Q" in the browser to compute it, then it shows in the lower right of the info box. If I could prove this was reasonable on a large set, maybe we can add it as a property to do this as you have suggested.

scrubbbbs avatar Jul 19 '23 11:07 scrubbbbs

Okay so to remove exact duplicates this seemed to work:

cbird -dups -select-result -sort-rev resolution -chop -nuke

And for similar images this seemed reasonable:

cbird -p.dht 1 -similar -select-result -sort-rev resolution -chop -nuke

~The default sort, score, is good enough for my use. not sure how much better "quality score" would be... In the images that I looked at quality score always was higher on the left-most copy.~

Thanks !

btw. It would be nice if something like

cbird -p.dht 0.5

would be possible? I'm assuming it is a limitation of the algorithm--but it would be nice to be able to be a bit more granular

chapmanjacobd avatar Jul 22 '23 07:07 chapmanjacobd

Hey, I'm glad you found a solution, thanks for following up.

I think you got lucky on the quality. When using -similar the first is the needle/query image. The needle selection is uncontrolled, it's just the first one that appeared when scanning for matches, so at best there is a weak ordering from when it was indexed.

As for DCT hash, the distance function is integer so that isn't an option. Granularity could be immediately improved by using a wider hash (currently 64 bits).

scrubbbbs avatar Jul 22 '23 15:07 scrubbbbs