Check if files in folder[param1] are present in folder[param2] without searching for dupes in folder[param2]
The use case is a large "backup" folder and various "have I put this in my backup folder yet?" folders. For example, copying files from a smartphone onto a NAS. I know there are duplicates in the large "backup" folder. Some are deliberate. I don't want to know about them every time I check (something about rubbing my nose in past mistakes).
My solution is to add -F (--compare-first) which makes it skip comparison on files that are only in directories after parameter 1: https://github.com/jyukumite/jdupes_mjs/commit/d710f920175f49d59d6a127b4e36c4a48afc1b6c (and https://github.com/jyukumite/jdupes_mjs/commit/bfdcdc9192c58cd61689e4482803244463acb54a to fix a typo).
(The rather-difficult "caching hashes" would also have fixed it, but I get that that's hard)
Original 22m6s
# jdupes -r dir1 dir2 >dupes4
Scanning: 46878 files, 1323 items (in 2 specified)
real 22m6.498s
user 0m52.972s
sys 1m42.872s
# cat dupes4 | wc -l
31208
13m38s with -F
# jdupes -rF dir1 dir2 >dupes3
WARNING: -F/--compare-first will not report duplicates if one file is not in first directory
Scanning: 46878 files, 1323 items (in 2 specified)
real 13m37.892s
user 0m30.584s
sys 0m54.948s
# cat dupes3 | wc -l
25440
17m2s with -Q (just use hashes)
# jdupes -rQ dir1 dir2 >dupes2
BIG FAT WARNING: -Q/--quick MAY BE DANGEROUS! Read the manual!
Scanning: 46878 files, 1323 items (in 2 specified)
real 17m1.530s
user 0m35.652s
sys 1m11.508s
# cat dupes2 | wc -l
31208
I'm intrigued, but I haven't looked, at why -Q makes it slower with -F ... repeatable results 2x. 11m45s with -QF
# jdupes -rFQ dir1 dir2 >dupes1
WARNING: -F/--compare-first will not report duplicates if one file is not in first directory
BIG FAT WARNING: -Q/--quick MAY BE DANGEROUS! Read the manual!
Scanning: 46878 files, 1323 items (in 2 specified)
real 11m45.436s
user 0m25.452s
sys 0m49.916s
# cat dupes1 | wc -l
25440
After removing the duplicates from folder[param1], only 16 seconds to find that there are no duplicate files I have to remove before backing up the rest - a speedup of 34x:
# ./jdupes -rF dir1 dir2
WARNING: -F/--compare-first will not report duplicates if one file is not in first directory
Scanning: 38421 files, 1323 items (in 2 specified)
real 0m15.643s
user 0m0.124s
sys 0m0.892s
# ./jdupes -r dir1 dir2
Scanning: 38421 files, 1323 items (in 2 specified)
real 9m8.656s
user 0m27.796s
sys 0m45.816s
13941
Bigger examples:
# time ./jdupes -rF dir1 dir2 dir3
WARNING: -F/--compare-first will not report duplicates if one file is not in first directory
Scanning: 1141783 files, 144643 items (in 3 specified)
real 5m8.015s
user 0m3.420s
sys 0m32.152s
# time ./jdupes -rF dir1 dir4
WARNING: -F/--compare-first will not report duplicates if one file is not in first directory
Scanning: 7550184 files, 358281 items (in 2 specified)
real 40m14.956s
user 0m18.488s
sys 6m5.864s
This seems like a combination of things that should be done with rsync and suboptimal data organization. If you want this, you really should consider using dupd instead of this program. It stores hashes in a persistent database across runs and is likely to be far more suited to your needs.
Possibly - I'll look further at dupd, although I saw that as an interactive, rather than scriptable, solution. These extra commands cover my use case quite nicely given that I'm not in control of the suboptimal (and dynamic) data organization. I was aiming for the best workaround. Thanks, (just tried dupd - it is a bit ram-heavy for my use-case, https://github.com/jvirkki/dupd/issues/32)
(You can use dupd in a more interactive way which is encouraged by the docs, but you could also just take its output and script around it if you prefer.)
This may be eventually possible as a combination of planned features, but for now I'm closing this.