darktable [FR] copy&import: select newest + handle naming conflicts

Is your feature request related to a problem? Please describe.

When importing photos (from sd-card), it's difficult to only select those images I want to import now.

in most cases I just want to import the photos from my last shoot (so the newest ones, e.g. from today)
in general I don't want to import photos I already imported before, but I don't want to miss new photos

Describe the solution you'd like

the list of images should be grouped by date
- the date-separator should not be midnight, but e.g. 03:00 am, maybe configurable
it should be easy to select all images of my last shoot with one click
- checkbox 'select newest pictures'
- selects the most recent date-group
the design should make it obvious, which images will be imported
- checkbox = will be imported
- barely visible = won't be imported
there should be a preview of the renamed filename
- naming conflicts should be highlighted
If there are conflicts, clicking 'import' should show a popup with the following options:
- don't import conflicts
- overwrite conflicts
- add a sequence-number
- manually change naming-pattern

This way it's easy to avoid duplicates, because the user can easily select the last shoot, and already imported images are very likely to generate a naming-conflilct-warning.

Additional context

mockup:

Alternatives

An alternative would be a duplicate check (see PR #11016)

Feb 21 '22 16:02 quovadit

the list of images should be grouped by date

The user can already sort out the images by date & time. It is trivial to select the right ones.

checkbox 'select newest pictures'

What is the criteria for "newest" grouping ?

the design should make it obvious, which images will be imported

I think you have already suggested to invert the selection. I agree we could improve the visual effect.

naming conflicts should be highlighted

I love the principle of the destination name conflict detection. But what is the cost performance wise ?

already imported images are very likely to generate a naming-conflilct-warning

That works as long as the user doesn't change the naming rules. What if the user does change them ?

Feb 21 '22 17:02 phweyland

thanks for the quick feedback!

The user can already sort out the images by date & time. It is trivial to select the right ones.

It's a matter of opinion whether it's trivial, for me it's quite error-prone, that's I would like to improve it

What is the criteria for "newest" grouping ?

"selects the most recent date-group"

I love the principle of the destination name conflict detection. But what is the cost performance wise ?

that's true, we have to test it, if it won't work on-the-fly, there might be a separate button 'verify' before 'rename'. That way we could only check the selected images, not the complete list

That works as long as the user doesn't change the naming rules. What if the user does change them ?

correct, if the user changes the naming rule there won't be a conflict-warning, that's why I said 'very likely'. In my opinion there won't be a perfect duplicate-prevention, but I think the combination of the two aspects: "only select newest" + "check if file with the same name already exists" is a transparent and efficient method.

Feb 21 '22 18:02 quovadit

This issue did not get any activity in the past 60 days and will be closed in 365 days if no update occurs. Please check if the master branch has fixed it and report again or close the issue.

Apr 23 '22 00:04 github-actions[bot]

in general I don't want to import photos I already imported before, but I don't want to miss new photos

I think the best way to address this is by storing a hash for each image in library.db, and then on import check against those hashes. Shotwell implements this approach and as a result is able to detect duplicate images even if they have been renamed.

One of the limitations of the implementation of #11016 (besides not checking hashes) is that it only detects duplicates of images that are imported after the feature was implemented. This could be addressed by searching the library for photos that are missing hashes (or any other fields required for duplicate detection) on start-up and computing any that are missing. This could take a while, but in most cases it would only have to be done once for a given library.

I love the principle of the destination name conflict detection. But what is the cost performance wise ?

I would think that a good implementation of this would have an acceptable impact on performance. Checking filenames does not even require opening the files.

The author of #11016 did not implement hashes based on an assumption that the performance cost of hashes would be too high. However, Shotwell implements hash checking as I mentioned above, and it is reasonably performant with imports. An option could be provided to turn off hash checking for users who find the performance impact unacceptable.

Jul 16 '22 14:07 jhaiduce

Is your feature request related to a problem? Please describe.

When importing photos (from sd-card), it's difficult to only select those images I want to import now.

in most cases I just want to import the photos from my last shoot (so the newest ones, e.g. from today)

in general I don't want to import photos I already imported before, but I don't want to miss new photos

don't store new photos in old directories, ie: new photos, new directory.

Jul 16 '22 14:07 ptilopteri

I would think that a good implementation of this would have an acceptable impact on performance.

That's the main issue and I don't think this is true. When you import 1800 photos (I've just done that) from a day of shooting for a spectacle and each image is around 50Mb (so a grand total of 90 Gb) there is no way this can be fast especially if you create a hash from a external card or camera which are somewhat slow.

Jul 16 '22 15:07 TurboGit

When you import 1800 photos (I've just done that) from a day of shooting for a spectacle and each image is around 50Mb (so a grand total of 90 Gb) there is no way this can be fast especially if you create a hash from a external card or camera which are somewhat slow.

That's fair, I'm usually working with a smaller number of images than that. With Shotwell I've imported 700+ images at once and gotten good performance, but that was copying a HDD rather than an SD card.

That said, hash checking or any other potentially expensive check can be made optional, so it's available for users who can afford the cost of them but doesn't impact those who can't.

Jul 16 '22 21:07 jhaiduce

don't store new photos in old directories, ie: new photos, new directory.

Importing into a new directory does nothing to prevent duplicates from being present in the import.

Also, DT allows users to use a custom directory structure; putting every import in its own directory is not the only option (even though it's the default). Users may well have good reason to organize their photos a different way. As long as DT allows users to customize the directory structure, they will reasonably expect that any logic for handling duplicate detection and naming conflicts will account for other directory structures besides the default.

Jul 16 '22 21:07 jhaiduce

I linked this to 2 other tickets that also aim to improve this module.

Regarding duplication, is it not enough to check for duplication during the import itself? The bigger the file and the slower the copy process, the more neglectable a quick check for the target name inside the folder should be. Anybody wants to guess how much time it takes to check for duplicates? If I import 700 images, that takes me probably 3minutes. If I add 5 seconds to the start or spend an additional 0.1 seconds per image, will I notice that?

What is the current behavior regarding duplicates on import?

Jul 28 '22 19:07 Solarer

Regarding duplication, is it not enough to check for duplication during the import itself? The bigger the file and the slower the copy process, the more neglectable a quick check for the target name inside the folder should be. Anybody wants to guess how much time it takes to check for duplicates? If I import 700 images, that takes me probably 3minutes. If I add 5 seconds to the start or spend an additional 0.1 seconds per image, will I notice that?

It would depend on how it's implemented. Computing hashes/checksums on the full files would probably result in a noticeable increase. But if the duplicate detection only looks at a portion of each file, the time could probably be reduced to something you wouldn't notice.

I just ran a few tests reading files directly off my camera with various programs to see how long various operations take. The camera currently contains 593 RAW files and 645 JPEG files, 13G in total. Here are my results:

Loading a directory for import in Shotwell (not actually importing): 98 seconds (0.08 s per file)
md5sum on each RAW file: about 4.5 minutes (0.45 s per file)
md5sum on the first 512 bytes of each RAW file: 2.23 seconds (0.0038 s per file)
md5sum on the first 512000 bytes of each RAW file: 14.37 seconds (0.024 s per file)
Copying all RAW files to a folder on another disk: about 5.5 minutes (0.55 s per file)

Shotwell is doing more than checking for duplicates here, it also generates a thumbnail for each image while loading the directory. But judging from the fact that Shotwell loaded the directory in less time than it took to run md5sum on all the RAW files in the directory, it follows that Shotwell must be reading only part of each file to accomplish its duplicate check and generate the thumbnails.

From this, it looks like a duplicate detection process that involves computing a complete checksum for each file in its entirety could increase the import time considerably (perhaps doubling or tripling the import time), but a duplicate check based on just a portion of the file's data (such as computing a hash from a few tens of kilobytes sampled from the file) could probably be made quick enough that you wouldn't notice the difference during import.

If the check was done at selection time (i.e. with the import dialog still open and before the import starts), the time required to check for duplicates would probably be noticeable even for a fast implementation when importing from a directory with hundreds or thousands of images, but could perhaps be made tolerable if some sort of progress indication were shown while the checks are run.

Jul 29 '22 13:07 jhaiduce

What is the current behavior regarding duplicates on import?

It looks like there is a form of duplicate detection in master since the merge of #11016 (which includes DT 4.0.0 and later). But #11016 only checks filename, date, and size, and only detects duplicates of images imported since that PR was merged, and not images imported by older versions of DT. Others might be able to provide a more complete answer, such as whether there has been further development on this since the merge of #11016.

Jul 29 '22 14:07 jhaiduce

Sorry, is this feature request about detecting naming collisions during import or about detecting duplicates using metadata and/or pixel data?

Checking for naming collisions makes sense during import. But I don't need darktable to check all my file hashes everytime because I do not reimport the same images again and again. There is very low risk for me to create duplicates during my regular usage of darktable. Yes, there could be some old duplicates in my library right now from that one time where I copied a folder by accident and I would like to run a deduplication task to get rid of them. But I only need that ONCE. Unless I make a really big mistake and import some old backup or clone my whole image directory I should not need to compare the hashes during every import.

Can this not be put into a module or LUA extension that can be run manually? That could also generate the missing data so that #11016 works on pre 4.0 images. I just miss to see the usecase where checking hashes is necessary on a daily basis.

Jul 29 '22 15:07 Solarer

My intention for this FR was NOT to check for duplicates.

My primary goal was to select the correct files to import on my sd-card, which is absolutely not obvious at the moment for me.

The second idea (which maybe should have been a separate FR) was to show a preview of the renamed filename. I thought of naming conflicts mainly because of the renaming-pattern (e.g. 2 photos within 1 second using hhmmss - see my screenshot above).

As a side effect this file-name-check would also find already-imported photos using the same naming-pattern.

But I never thought about checking hashes of all photos on every import.

Jul 29 '22 15:07 quovadit

Sorry, is this feature request about detecting naming collisions during import or about detecting duplicates using metadata and/or pixel data?

I originally tried to suggest improvements to duplicate detection in #12180, but was told my FR was a dup of this one so I added my comments here, even though duplicate detection was not the OP's intent. I'd be Ok with moving the discussion of hashes to a different FR if people prefer that.

Checking for naming collisions makes sense during import. But I don't need darktable to check all my file hashes everytime because I do not reimport the same images again and again.

The way this happens for me is I leave images on my camera after importing into DT. The reason for that is I don't want to delete them from the camera until the copies imported to DT have been backed up. Occasionally I'll also leave some images on the camera for the purpose of displaying them on the camera's screen later. As a result I often end up importing new images from my camera while some old images are still there.

Can this not be put into a module or LUA extension that can be run manually? That could also generate the missing data so that #11016 works on pre 4.0 images.

That sounds like a viable option to me, though I'm not an expert on what is or isn't possible through modules or Lua extensions.

Jul 29 '22 15:07 jhaiduce

I originally tried to suggest improvements to duplicate detection in https://github.com/darktable-org/darktable/issues/12180, but was told my FR was a dup of this one so I added my comments here, even though duplicate detection was not the OP's intent.

Ah ok. Yeah, maybe a dedicated PR might be better since this is mainly about filename.

Jul 29 '22 15:07 Solarer

This issue did not get any activity in the past 60 days and will be closed in 365 days if no update occurs. Please check if the master branch has fixed it and report again or close the issue.

Sep 28 '22 00:09 github-actions[bot]

This issue was closed because it has been inactive for 300 days since being marked as stale. Please check if the newest release or nightly build has it fixed. Please, create a new issue if the issue is not fixed.

Jul 26 '23 00:07 github-actions[bot]

I would love it if this was implemented. Without this I risk reimporting duplicates and have to manually delete them. I wonder why "already imported picture" detection wasn't done using hashes initially? Too expensive?

May 03 '24 04:05 bogardon

darktable darktable copied to clipboard

[FR] copy&import: select newest + handle naming conflicts

darktable
darktable copied to clipboard