[Feature request] Add command line parameter to treat symbolic links to folders like folders
I've noticed, that if I give findimagedupes a path like e.g. ~/a/pictures/vacation, then it will work if a is a symbolic link to e.g. /mnt/storage/pictures, but if will do nothing, if vacation is a symbolic link to somewhere else. So it would be quite nice, to have findimagedupes treat a symbolic link to a folder just like any other folder - maybe toggle it via a command line parameter to not break someone else's workflow.
- Internally, all paths are canonicallsed using
realpath. By storing an absolute path, if a fingerprint database is moved to a different directory, the files it points to can still be found. - Currently, to avoid infinite looping, the program never follows symbolic links to directories.
To implement the feature you are asking for requires deciding on desired behaviour for symbolic link handling in general. If I implement a --no-realpath-type option, I think when it is used:
- require all paths passed as command arguments to already be absolute (for (1))
- infinite loops must still be impossible (for (2))
- what other considerations / edge cases?
GNU find has -L and -H options.. Busybox find has no such option but does seem to detect potentially-infinite looping. What other similar functionality exists in other programs that we can examine?
I can see a naive approach to loop detection for findimagedupes but my feeling is that it could be quite inefficient (time and storage). I'd have to have a think about how to implement efficiently.
I just noticed you have ~ in the path.
(Unquoted) tilde-expansion is done by POSIX shells before findimagedupes sees it.
Can you give me some examples of working/non-working links (source path and link contents as shown by ls -l for example).
You may find that appending a trailing / to command-line arguments that are symbolic links to directories fixes your problem.
With / at the end I get an incredibly weird behaviour: Some folder work just as intended, while others mark all files as duplicates with two different paths per file (the one with the symlink and the one with the value of the symlink resolved). I will have to look into how the two kinds of folders differ later.
I use findimagedupes in a small custom python script to generate all the paths etc, so I tested it with one example, that worked, but running it with all, broke something. But weirdly enough not on all files/folders.
Maybe because it still used the old fingerprint files? (I will look into that in the evening)
Note that without the slash, if the final element of a path is a symlink (like your vacation example), it will simply be ignored. You should see warnings if that happens.
My understanding of Perl's realpath is that the result is always "canonical" - no path elements are symlinks and none are . or ... If you are seeing output where the path still contains symlinks something odd is happening.
Unfortunately it's tricky for me to debug without any concrete examples to look at.
I found the cause of that weird behaviour: The one drive was getting rather full, so some subfolders were moved to a new location and then the old structure recreated via symlinks. So let's say the old path was at /mnt/storage1/a then the new one would be at /mnt/storage2/a with a symlink at the old place pointing to the new one.
Findimagedupes was used at the old place before it was moved and a fingerprint file was created for the folder a at e.g. ~/fingerprints/a. Now if it is used at /mnt/storage1/a/ with the old fingerprint file, it will find each file "twice", something like /mnt/storage1/a/example.jpg and /mnt/storage2/a/example.jpg. They obviously have the same content (since they are the same files), so they are found as duplicates.
The simplest (but sadly slow) solution seems to be to just remove the old fingerprint files and let them be newly generated, since the files that weren't flagged, were those, that only were added after the folders were moved.
Yes, that makes sense. findimagedupes has no way to know that two different paths are actually the same file. This is certainly a situation where ability to not use realpath would be useful. I'll keep this issue open as a possible future option to add.
If you know which subpaths are due to such moves, you may be able to filter findimagedupes output to rewrite and de-dupe it with some custom shellscript code (-i / -I) but this is probably not sensible for automated use.
At some point I'd like to replace the existing tied hash berkeleydb fingerprint database format with sqlite which would be very easy to manually edit. You may be able to update your current fingerprint databases (test on copies! this code overwrites in-place) with something like:
VERBOSE=1 perl -MDB_File -le '
for (@ARGV) {
tie %db, "DB_File", $_ or die $!;
for $old (keys %db) {
$new = $old;
$new =~ s{/old/sub/path1/}{/new/sub/path1/};
$new =~ s{/old/sub/path2/}{/new/sub/path2/};
# ...
if ($new ne $old) {
warn "$old --> $new" if $ENV{VERBOSE};
$db{$new} = delete $db{$old};
}
untie %db;
}
' fpdb1 fpdb2 ...
This code doesn't check if path $new already exists. Add an assertion before the warn if that is a concern.
Implementation question for you.
Given this hierarchy:
/
├── L1 -> d1/
├── d1/
│ └── d11/
│ └── f
└── d2/
└── L21 -> ../d1/d11/
which has three different paths to file f:
/L1/d11/f
/d1/d11/f
/d2/L21/f
how should findimagedupes choose which path to store if following symlinks and not using realpath?