hydrus icon indicating copy to clipboard operation
hydrus copied to clipboard

Bulk Remove "Exact Duplicates"

Open KamillaPup opened this issue 3 years ago • 4 comments

Everything is amazing about Hydrus and i love the duplicate finder, but one negative is it takes awhile to go through duplicates if you have several thousand. While it's nice to see how the images differ, I wish there was a way to bulk remove all exact duplicates, move the tags over to one of them, keep a deletion record, and keep the smallest file (since it would just be metadata causing any file bloat).

I know you can choose to cycle through just exact duplicates but it isn't reliable as it sometimes shows non-exact, but maybe I'm doing that part wrong. Regardless, it's not a bulk removal.

KamillaPup avatar Feb 21 '22 06:02 KamillaPup

Until this feature gets added, is there any DIY automation a user that knows programming can apply here, outside of contributing a full blown PR?

Anvoker avatar Oct 25 '23 17:10 Anvoker

https://hydrusnetwork.github.io/hydrus/developer_api.html#managing_file_relationships

Zweibach avatar Oct 25 '23 21:10 Zweibach

For however little it's worth: in my opinion, the smallest image is not always the best one.

There are some factors to consider besides file size, such as how many PTR users have a certain copy of a file and have thus tagged it; if you always prefer the smallest file, then the tags will all collect at that convergence point, but people who don't have that smallest version of the file miss out on the tags. This is suboptimal both from a UX standpoint and a "don't crap up the PTR" standpoint.

To avoid a wall of text I've collapsed the points behind dropdowns, but tl;dr, metadata is useful a lot of the time, but just preferring files that have metadata isn't good enough, so we need either a series of god-tier heuristics everyone can agree upon, or we need it to be highly customizable, maybe even scriptable.

First let's consider a common case where metadata gets stripped but has value and should be preserved.
Let's say an artist that some Hydrus users follow posts an image behind a paywall site. Those users get it from their subscription, tag it, and push the tags to the PTR. Meanwhile, some random schmuck on the internet downloads it, puts it through some meat grinder that strips metadata, then uploads it to a booru. (Hell, maybe the booru is said meat grinder.) Other Hydrus users take the image from the booru instead. This version from the booru is smaller, yes, but maybe it's lost its ICC profile, embedded thumbnail, or other useful metadata that helps the thing render correctly or faster in various circumstances. Is it superior? I'd say no. I certainly don't filter my dupes that way. I prefer to "original image" whenever possible.
Now let's consider the not-infrequent-enough case of artists reuploading their work.
It gets more complicated when you consider the various kinds of meat grinders images get put through and how people tend to reupload their own images. I have seen hundreds of examples of an artist reposting their old work, but in their repost, instead of reuploading the exact same file, they open up their old project file (PSD or whatever) and re-export it, and worse, this actually creates a file that is not a pixel dupe compared to the last one. A helpful way to figure out what the heck happened is to look at the metadata that their editor likely added to show when the file was last revised (because you can't always rely on mtimes, even when they are extrapolated from parsers, since people repost the same content on different sites.) Now, let's say you download this re-exported version from various sources over time–the artist's own page(s), a booru, maybe somewhere else–all of them are pixel dupes of each other, but only one has the original metadata, and some have garbage metadata tacked on by whatever weird meat grinder the booru schmucks put the file through. Again, which file is best? In this case, it's even more certainly the "original" file out of this batch, because that one's the one that has the metadata that helps you figure out how it compares to the file from before the artist re-exported it. (This has happened to me before on multiple occasions with multiple artists! Thank goodness that Photoshop puts some revision history metadata in the exported files!)
And yet you can't just go with the heuristic of "having metadata is good, not having it is bad," nor "more metadata is good," because there are instances where two files will both have metadata, but one is metadata inserted by some software or host that was part of the meat grinder.
I sometimes like to use boorus to outsource the tagging of my original files, allowing dupe filtering to take care of merging the tags, and I have seen files with metadata that indicates they were at some point uploaded to some weird pirate site or put into some weird image management software, both of which tack on fields that advertise themselves; on rare occasion, the original files _won't_ have metadata, meaning that the version that went through some pirate site would get treated as better if the presence of metadata was preferred over the size difference. This also applies to situations where both have the same metadata, except for that tacked on field. And it goes the other way, too: sometimes the booru version will have most of the metadata, but will be missing specifically just a couple metadata fields for some reason.

There are plenty of other things that happen where making a decision gets muddy:

  • It can be a coin flip whether the publicly-posted version of a portion of a larger set of images behind a paywall is better than the version behind the paywall.
    Artists will make a CG set and post just 2 or 3 out of, like, 12+ variants on their pixiv, along with some filler image that says "more on fanbox" or whatever. Then they post the full set on their fanbox, including the few images that were made public. These first few images might be pixel dupes, but both may be missing metadata, and yet somehow I find that they rarely have the same hash. I don't know why this happens, but at any rate, more people on the internet are gonna get the publicly posted versions, so those might have more useful tags on them.
  • Some sources seem to like to invent metadata fields if they're missing from the original file.
    A very common thing I see is booru uploads of something I know I have the original version of (because I got it from the paywalled site a month earlier) having resolution unit fields in their metadata that aren't in the original, usually it's inches. Are these fields useful? If they're made up and not part of the original work, then I'd say no, but in general, these fields could be useful for print, so you can't just say certain metadata fields are always useless. To complicate things further, in a lot of these cases, the original file will have some fields that the reupload doesn't have, so one's not a subset of the other. This makes using the heuristic of subsets or set differences tricky.
  • There are pixel dupes that have the same exact filesize but a different hash.
    I know you can say "these are same quality duplicates" but who honestly keeps both versions? There's usually metadata indicating which one was made first; this is again one of those "artist posts a CG" things, this time they post versions with and without text, but the versions with text have an image in their set that just doesn't happen to have text anyway, and yet the export that image twice, creating two images with different metadata fields that take up the same number of bytes. Maybe this feature would just stop if it runs into such cases, but wouldn't we all rather decide that one is better than the other?
I think the only clear-cut cases I ever see are pixel-dupe JPEG/PNG pairs, PNGs with different compression levels, and JPEGs that are identical except for their encoding.
In these cases, I think most of us would agree that the smaller filesize is, in fact, better.
A PNG that's just a pixel dupe of a JPEG but put in a PNG format is nothing but a waste of disk space.
Same with less-compressed PNGs; nobody's CPU is so slow anymore that uncompressed PNGs are preferable.
And as for JPEG encodings, the Progressive encoding is always smaller, as far as I can recall. This usually happens if an artist exports a Baseline encoded-image and posts that to both Twitter and Pixiv; Twitter seems to re-encode with Progressive encoding and somehow not degrade the quality, a rare instance of Twitter doing something good. Since the artist posted the same image on two places at roughly the same time and there's no real authoritative "original" version with more useful metadata or anything like that, we can treat the smaller image as better.

As I see it, there's a few paths forward if you want this feature:

  1. Build up some complex system of heuristics, potentially with users being able to customize them with weights, that automatically manages the decisions... and that system will inevitably produce suboptimal outcomes in many cases because it's hard to get such a thing right, especially if "right" is subjective;
  2. Make it scriptable, kind of like how parsers work, with canned operations to make your own decision tree(s) from. Then again, at that point, why not just wait for some intrepid third party to script this up with the API like inquired about above?
  3. Reduce the scope to only cases where it's so clear-cut and obvious that we can (mostly) all agree that the decisions it'd make are correct, and then handle the rest with normal dupe processing... this is probably the route I see Hydev going down;
  4. Just go with smallest size and deal with all the consequences I talked about here, and more I didn't even think of.

roachcord3 avatar Jun 28 '24 23:06 roachcord3