immich icon indicating copy to clipboard operation
immich copied to clipboard

feat(server): near-duplicate detection

Open mertalev opened this issue 11 months ago • 20 comments

Description

This PR adds a new job to detect duplicate assets and aggregate them with a new duplicateId column. This PR only implements the backend for duplicate detection. It does not expose the results in the UI or take any actions relating to the assets: this is left for future work.

The data model is such that each (duplicateId, assetId) pair uniquely identifies a duplicate asset and each duplicateId can have many associated assets.

To do:

  • [x] Handle edge case where multiple duplicateIds exist among the found duplicates
  • [X] Better handling of concurrency
    • Disabled concurrency to avoid race conditions and improve accuracy
  • [x] Confirm correctness of the results
  • [x] Tune default threshold
  • [x] Add migration
  • [x] Add tests

Implements #1968

How Has This Been Tested?

Tested by running the new job on all assets through the job panel and inspecting logs to confirm that some assets have duplicates.

Tested that the duplicates displayed in the web view are actually near-duplicates.

Tested that changing the duplicate threshold changes the strictness of the results.

mertalev avatar Mar 23 '24 23:03 mertalev

Deploying immich with  Cloudflare Pages  Cloudflare Pages

Latest commit: c364bc8
Status: ✅  Deploy successful!
Preview URL: https://34e77b94.immich.pages.dev
Branch Preview URL: https://feat-duplicate-detection.immich.pages.dev

View logs

Can I help with anything regarding this PR, I am happy to work on UI

alextran1502 avatar Apr 28 '24 00:04 alextran1502

I cleaned it up so the backend part is essentially good to go (might need to adjust the response if you want them to be grouped by duplicates and not just sorted). The UI... has a lot of room for improvement haha. It'd be great if you could help with that 😄

mertalev avatar Apr 28 '24 00:04 mertalev

In terms of UI, would it make sense if photo stacks were automatically created for near-duplicate photos? It's something that the App-Which-Must-Not-Be-Named introduced a while ago and I personally find it very useful.

klejejs avatar Apr 28 '24 21:04 klejejs

The idea right now is to have the duplicates displayed in a dedicated page where there are options to convert them to stacks or deduplicate based on some criteria, but otherwise treat them as separate assets until the user elects to do this. Auto-stacking would be very useful and convenient, though, so a later PR to add this functionality would be nice.

mertalev avatar Apr 28 '24 21:04 mertalev

Auto-stacking would be very useful and convenient, though, so a later PR to add this functionality would be nice.

Is it possible to have some kind of notification so that the user can interact with stacking? It's nice when images are stacked automatically, but I find that this sometimes occurs erroneously and I'd like to at least know when a stack has been made.

AngelaDMerkel avatar May 02 '24 12:05 AngelaDMerkel

A notification in the web UI would be straightforward. But auto-stacking would be a later addition, so discussion on that is a bit out of scope for this PR.

mertalev avatar May 02 '24 15:05 mertalev

Thanks for working on this! It's awesome to read through the implementation here.

Just wanted to add my results, the 0.2 threshold (barely) didn't detect duplicates in my resized image test case. I added in some console logs and set the distance to 1.0, to see 0.02084 as the distance 😭🤦‍♂️

Is it worth perhaps working on a multi algorithm implementation? Experimentation shows pHash excels at resizing. I can look for some time to help add this - please do let me know.

"duplicateId": "79b47ba0-28e5-479b-b745-9f2885299077",
immich_microservices     |       "assetId": "c8c4d6e9-8d72-4e55-be22-20b9019ffca6",
immich_microservices     |       "distance": 0.020842433

I was testing resized images here (download to reproduce):

images-dupe-test.zip

async searchDuplicates({
    assetId,
    embedding,
    maxDistance,
    userIds,
  }: AssetDuplicateSearch): Promise<AssetDuplicateResult[]> {
    maxDistance = 1;
    this.logger.warn('searching duplicates', { assetId, maxDistance, userIds });
    const cte = this.assetRepository.createQueryBuilder('asset');
    cte
      .select('search.assetId', 'assetId')
      .addSelect('asset.duplicateId', 'duplicateId')
      .addSelect(`search.embedding <=> :embedding`, 'distance')
      .innerJoin('asset.smartSearch', 'search')
      .where('asset.ownerId IN (:...userIds )')
      .andWhere('asset.id != :assetId')
      .andWhere('asset.isVisible = :isVisible')
      .orderBy('search.embedding <=> :embedding')
      .limit(64)
      .setParameters({ assetId, embedding: asVector(embedding), isVisible: true, userIds });

    const builder = this.assetRepository.manager
      .createQueryBuilder()
      .addCommonTableExpression(cte, 'cte')
      .from('cte', 'res')
      .select('res.*')
      .where('res.distance <= :maxDistance', { maxDistance });

    const results = (await builder.getRawMany()) as any as Promise<AssetDuplicateResult[]>;

    this.logger.warn('found duplicates', { results });

    return results;
  }

image

PathToLife avatar May 04 '24 07:05 PathToLife

The duplicate threshold will be exposed in the admin settings. I was debating between defaulting to 0.02 or 0.03, so maybe 0.03 is the better default after all.

mertalev avatar May 04 '24 08:05 mertalev

In terms of UI, would it make sense if photo stacks were automatically created for near-duplicate photos? It's something that the App-Which-Must-Not-Be-Named introduced a while ago and I personally find it very useful.

This

Auto-stacking would be very useful and convenient, though, so a later PR to add this functionality would be nice.

Is it possible to have some kind of notification so that the user can interact with stacking? It's nice when images are stacked automatically, but I find that this sometimes occurs erroneously and I'd like to at least know when a stack has been made.

And this, will be separate. Once this is implemented, these features can be worked on. So they will basically be built off of the code in this feature.

NicholasFlamy avatar May 07 '24 12:05 NicholasFlamy

I would suggest taking some inspiration from Samsung Gallery for the UI. When you hit delete duplicates it selects all but one of each of the duplicates (so if there is 3 it will select 2). I'm pretty sure it takes date modified or something and if they are different resolutions or file sizes it selects the lower resolution or filesize. Then you can hit delete with all of the duplicates selected.

I think implementing something similar wouldn't be too difficult, and it doesn't even have to select for you but the side-by-side view is the most important thing. Having a button to select duplicates which could prefer selecting the lower resolution/filesize would be an added bonus.

Screenshot_20240507_110313_Gallery Screenshot_20240507_110337_Gallery Screenshot_20240507_110457_Gallery Screenshot_20240507_110352_Gallery

FYI the Testing Immich Album has photos I copied over for testing immich and I only select this album in the mobile app to protect my photos from bugs etc., so they are duplicates.

NicholasFlamy avatar May 07 '24 15:05 NicholasFlamy

Hey there, amazing work on this PR! Just a thought - how about we get rid of those JPEG duplicates when we've got the original HEIC files? What's your take on this?

be1ski avatar May 07 '24 17:05 be1ski

Hey there, amazing work on this PR! Just a thought - how about we get rid of those JPEG duplicates when we've got the original HEIC files? What's your take on this?

If the files are basically identical then this should pick that up. If you're suggesting that it automatically prefer HEIC, that's for the UI which is coming eventually.

NicholasFlamy avatar May 07 '24 18:05 NicholasFlamy

There will be an option to deduplicate based on resolution, file size, etc. That will get you most of the way there, except in cases where the HEIF is smaller than JPEG purely because it's a more efficient format.

Doing it based on format sounds iffy. You can have a high resolution, high quality JPEG that looks similar to a poor quality HEIF, not to mention that we'd need an arbitrary ranking for which format is better.

We can always expand on this in the future, possibly with a measure of compression artifacts and selecting the image with the least artifacts. But for the first cut, it's better to keep it simple.

mertalev avatar May 07 '24 18:05 mertalev

@mertalev what do you think? Screenshot_20240507_185727_Chrome

I haven't done much but at least you can get out of there.

NicholasFlamy avatar May 07 '24 22:05 NicholasFlamy

Nice! I'll reduce the scope of this PR to just be the backend changes so we can do the UI separately.

mertalev avatar May 08 '24 00:05 mertalev

After removing the UI changes, this PR is ready for review. The current behavior is that the feature is disabled by default and not exposed to the user except through the config file. The only blocker is that a seemingly unrelated E2E test is failing.

mertalev avatar May 08 '24 06:05 mertalev

Another complement about this functionality. Since it's AI based, it picks up 2 different pictures taken directly after one another at slightly different angles or distances. So when I take multiple pictures just in case one of them is blurry bur then later have a bunch of extra, this should be the solution.

Long screenshot: Screenshot_20240508_082317_Chrome

Ignore the sidebar, I tapped the button which scrolls down and adds to the screenshot and it did that.

NicholasFlamy avatar May 08 '24 12:05 NicholasFlamy

In addition to hashing, the exif spec contains a field for OriginalFileName which could be used to match duplicates created from an original. A lot of software writes this field and would resolve the need to determine whether heic or jpeg (for example) is the original

AngelaDMerkel avatar May 08 '24 14:05 AngelaDMerkel

In addition to hashing, the exif spec contains a field for OriginalFileName which could be used to match duplicates created from an original. A lot of software writes this field and would resolve the need to determine whether heic or jpeg (for example) is the original

I think that would be saved for later. For the UI. Alex and I discussed UI development and Alex will develop most of it but I'll try and start on it this week. We are thinking of a Utilities page which has the deduplication page. I am taking note of what you said, I'm not sure how much logic will go into the deduplication page but that seems like a good idea.

NicholasFlamy avatar May 08 '24 14:05 NicholasFlamy