jdupes icon indicating copy to clipboard operation
jdupes copied to clipboard

Feature: detecting duplicate directories

Open ivanperez-keera opened this issue 8 years ago • 12 comments

An advanced feature I've been missing in these duplicate finding programs (fdupes, jdupes, and I think dupd) is finding duplicate directories.

Often, when backups are desynced or when things are moved around, one would like to know if a whole directory is the same as some other directory elsewhere.

I suspect this might be possible by "just" creating a hash for each directory, from the list of all the hashes of everything inside, sorted alphabetically, and built recursively bottom-up.

It might be possible for two directories to be duplicate if they contain the same contents, or if they have the same content, with the same file-names for it.

I suspect something smart would need to be done to resolve cycles, symlinks. Also, discarding empty directories might be useful.

ivanperez-keera avatar Dec 16 '17 19:12 ivanperez-keera

I have a theoretical solution for this already (it's not quite as simple as "hash for each directory" in practice), I just haven't had time to implement it and I forgot to add it to the to-do list, so I'm tacking this on to the 2.1 milestone.

jbruchon avatar Jan 02 '18 03:01 jbruchon

Sounds great. Looking forward to it :)

Out of curiosity, why aren't hashes of sorted filenames/hashes enough?

ivanperez-keera avatar Jan 03 '18 09:01 ivanperez-keera

That might be enough for a directory with identical filenames but it's not enough for finding a directory with identical files that may or may not differ in name. For example, you have a bunch of pictures called IMG_06xx.JPG from your camera and you dumped the same pictures on an old XP machine and told it to name them "ivanperez-keera party" so the files are named "ivanperez-keera party 0xx.jpg" but are identical. You merge the XP machine's data set into your current data set and now you have two identical camera dumps with different names. You'd want to catch that the "20090105_IMAGES" and "ivanperez-keera party" are identical and probably delete one of those wholesale instead of individually going "keep 1, keep 1, keep 1, keep 1" in interactive delete mode.

I'd like to have a "directory full of identical filenames" feature too, but most people want to find directories full of identical file data, not just names. One problem with that is that finding identical file data requires hashing all of the files in a directory and its subdirectories which effectively neuters all the performance boosts that the jdupes algorithm was crafted for, but there are some sanity checks like checking cumulative sizes of all files and subdirectories within each directory and only attempting a full directory match where those numbers are equal. Part of the reason I want to hold off until 2.1 is that I'm planning some major re-organization of the code for 2.0 which will make new matching concepts like this a lot easier to implement.

jbruchon avatar Jan 03 '18 14:01 jbruchon

That might be enough for a directory with identical filenames

No, no, that's what I mean by hash. I meant content hash (some checksum that uniquely identifies the file, regardless of its name). A counter for the number of times the file with a given checksum appears would make sense.

I'd like to have a "directory full of identical filenames" feature too.

Yeap, this additional feature would be useful.

Part of the reason I want to hold off until 2.1 is that I'm planning some major re-organization of the code [...].

Makes a lot of sense. Thanks for taking the time to explain.

ivanperez-keera avatar Jan 03 '18 20:01 ivanperez-keera

Maybe it goes without saying, but I'd like to see a version of this with the symlink replacement functionality. In that case, I imagine that it should only do the replacement if all filenames are identical in each directory along with all file content being identical. The goal here being to preserve functionality of programs using files in the original or symlinked locations.

MegaByte avatar Jul 11 '18 23:07 MegaByte

@MegaByte The general plan in jdupes-2.0 is to generalize the program's behavior such that all internal behavior is abstracted away and decoupled. Every step of the process that is just hard-coded assumptions now is to become changeable. Headings from the 2.0 planning document outlining portions that will have added flexibility and manipulation options include:

  • File list input modes
  • Database of hashes
  • Matching modes (allow inversion where appropriate)
  • Single-file include/exclude filters (should work with file list input too)
  • Multi-file exclude filters (should work with file list input too)
  • Multi-file exclude (no-action) filter modifiers
  • File list sorts
  • Action selection modes
  • Actions

As it stands, the program doesn't allow all possible combinations of file selections and file actions which has always been a major annoyance for me. Why can't we ask for a full match printout AND a summary in one shot? Why can't we hard link matches that meet one set of criteria and symlink matches that meet a different set? Because the program was never written properly for such expansion in the first place. Every feature I add is ultimately an ugly hack. There's plenty of duplicated code hiding in there that isn't easily abstracted to a clean shared function without more work than it's worth. Never mind that one of my long-term plans is to make a core library for use by front-ends (ncurses, GUI, etc.)

So, long story short, your desired feature would be available by default once jdupes-2.0 drops. Think something like this (not going to be the actual syntax):

root@host:/root# jdupes --select-recursive /home/user/dir1 /home/user/dir2 /home/user/dir3 /mnt/device/something --match=dupedirs-only --matchmod=dupedir-norecurse --read-only=/home/user/dir3 --actions=printmatchdirs,symlink,sizechange

[dupedirs-only:match] '/home/user/dir1/something' = '/home/user/dir2/backup/tmp/something'
[dupedirs-only:match] '/home/user/dir1/something' = '/mnt/device/something/really/far/down/i/didnt/know/was/there'
[dupedirs-only:match] '/home/user/dir1/something' = '/home/user/dir3/something'
[dupedirs-only:match] '/home/user/dir2/backup/tmp/something' = '/mnt/device/something/really/far/down/i/didnt/know/was/there'
[dupedirs-only:match] '/home/user/dir2/backup/tmp/something' = '/home/user/dir3/something'
[dupedirs-only:match] '/home/user/dir3/something' = /mnt/device/something/really/far/down/i/didnt/know/was/there

[symlink:directory:norecurse] '/home/user/dir1/something' => '/home/user/dir2/backup/tmp/something'
[NOTICE] [symlink:directory:norecurse] not linking (read-only): '/home/user/dir1/something' => '/home/user/dir3/something'
[symlink:directory:norecurse] '/home/user/dir1/something' => '/mnt/device/something/really/far/down/i/didnt/know/was/there'
[NOTICE] [symlink:directory:norecurse] not linking (read-only): '/home/user/dir2/backup/tmp/something' => '/home/user/dir3/something'
[NOTICE] [symlink:directory:norecurse] not linking (changed since checking): '/home/user/dir2/backup/tmp/something' => '/mnt/device/something/really/far/down/i/didnt/know/was/there'
[NOTICE] [symlink:directory:norecurse] not linking (changed since checking): '/home/user/dir3/something' => '/mnt/device/something/really/far/down/i/didnt/know/was/there'

While it is a crude mock-up, I hope it makes my vision clearer in the context you're interested in.

jbruchon avatar Jul 12 '18 00:07 jbruchon

I saw that my last post was a dup of #8

Olen avatar Aug 28 '18 10:08 Olen

Hi @jbruchon . I wonder if you've had a chance to look into this at all, or if perhaps it could be added in one of the near-future releases.

It is a hard feature to get right but, 2 years later, no duplicate-file-finding software addresses this. It would be an amazing feature to have imo.

ivanperez-keera avatar Feb 08 '20 13:02 ivanperez-keera

@ivanperez-keera I have a strategy for doing this in mind. I am in the process of rewriting a portion of the core algorithm that is causing a lot of maintenance issues. Once that's done, I can look at this again.

jbruchon avatar Feb 08 '20 13:02 jbruchon

Great! Thanks for letting me know.

ivanperez-keera avatar Feb 08 '20 13:02 ivanperez-keera

Still not implemented? Any chance it will get implemented eventually?

MatthiasLohr avatar Nov 23 '22 19:11 MatthiasLohr

I can't say if it will. I would find it useful myself, but I'm pretty keen on just rewriting the entire program before working on any additional features that aren't easy to implement. This is not a simple feature to implement by any means.

jbruchon avatar Nov 23 '22 20:11 jbruchon