jdupes Check if files in folder[param1] are present in folder[param2] without searching for dupes in folder[param2]

The use case is a large "backup" folder and various "have I put this in my backup folder yet?" folders. For example, copying files from a smartphone onto a NAS. I know there are duplicates in the large "backup" folder. Some are deliberate. I don't want to know about them every time I check (something about rubbing my nose in past mistakes).

My solution is to add -F (--compare-first) which makes it skip comparison on files that are only in directories after parameter 1: https://github.com/jyukumite/jdupes_mjs/commit/d710f920175f49d59d6a127b4e36c4a48afc1b6c (and https://github.com/jyukumite/jdupes_mjs/commit/bfdcdc9192c58cd61689e4482803244463acb54a to fix a typo).

(The rather-difficult "caching hashes" would also have fixed it, but I get that that's hard)

Original 22m6s

# jdupes -r dir1 dir2 >dupes4
Scanning: 46878 files, 1323 items (in 2 specified)
real    22m6.498s
user    0m52.972s
sys     1m42.872s
# cat dupes4 | wc -l
31208

13m38s with -F

# jdupes -rF dir1 dir2 >dupes3
WARNING: -F/--compare-first will not report duplicates if one file is not in first directory
Scanning: 46878 files, 1323 items (in 2 specified)
real    13m37.892s
user    0m30.584s
sys     0m54.948s
# cat dupes3 | wc -l
25440

17m2s with -Q (just use hashes)

# jdupes -rQ dir1 dir2 >dupes2
BIG FAT WARNING: -Q/--quick MAY BE DANGEROUS! Read the manual!
Scanning: 46878 files, 1323 items (in 2 specified)
real    17m1.530s
user    0m35.652s
sys     1m11.508s
# cat dupes2 | wc -l
31208

I'm intrigued, but I haven't looked, at why -Q makes it slower with -F ... repeatable results 2x. 11m45s with -QF

# jdupes -rFQ dir1 dir2 >dupes1
WARNING: -F/--compare-first will not report duplicates if one file is not in first directory
BIG FAT WARNING: -Q/--quick MAY BE DANGEROUS! Read the manual!
Scanning: 46878 files, 1323 items (in 2 specified)
real    11m45.436s
user    0m25.452s
sys     0m49.916s
# cat dupes1 | wc -l
25440

After removing the duplicates from folder[param1], only 16 seconds to find that there are no duplicate files I have to remove before backing up the rest - a speedup of 34x:

# ./jdupes -rF dir1 dir2
WARNING: -F/--compare-first will not report duplicates if one file is not in first directory
Scanning: 38421 files, 1323 items (in 2 specified)
real    0m15.643s
user    0m0.124s
sys     0m0.892s

# ./jdupes -r dir1 dir2
Scanning: 38421 files, 1323 items (in 2 specified)
real    9m8.656s
user    0m27.796s
sys     0m45.816s
13941

Bigger examples:

# time ./jdupes -rF dir1 dir2 dir3
WARNING: -F/--compare-first will not report duplicates if one file is not in first directory
Scanning: 1141783 files, 144643 items (in 3 specified)
real    5m8.015s
user    0m3.420s
sys     0m32.152s

# time ./jdupes -rF dir1 dir4
WARNING: -F/--compare-first will not report duplicates if one file is not in first directory
Scanning: 7550184 files, 358281 items (in 2 specified)
real    40m14.956s
user    0m18.488s
sys     6m5.864s

Jun 08 '21 02:06 jyukumite

This seems like a combination of things that should be done with rsync and suboptimal data organization. If you want this, you really should consider using dupd instead of this program. It stores hashes in a persistent database across runs and is likely to be far more suited to your needs.

Jun 08 '21 03:06 jbruchon

Possibly - I'll look further at dupd, although I saw that as an interactive, rather than scriptable, solution. These extra commands cover my use case quite nicely given that I'm not in control of the suboptimal (and dynamic) data organization. I was aiming for the best workaround. Thanks, (just tried dupd - it is a bit ram-heavy for my use-case, https://github.com/jvirkki/dupd/issues/32)

Jun 08 '21 04:06 jyukumite

(You can use dupd in a more interactive way which is encouraged by the docs, but you could also just take its output and script around it if you prefer.)

Jun 08 '21 06:06 jvirkki

This may be eventually possible as a combination of planned features, but for now I'm closing this.

Dec 05 '22 01:12 jbruchon