--hash-unmatched seems to scan the whole dataset, like --hash-uniques
rmlint version
- current rmlint develop (58d29ec1),
v2.10.1-281-g58d29ec1 - two patches applied:
- revert of a575d6ef2646c1fac1abaa6df79efecaa88e02d9 to fix #596
- a patch to
gui/setup.pyto fix #608
dataset
I have a 30-something TB dataset, that consists of ~20 TB uniques and ~11 TB size-twins:
$ du -hs /mnt/data
32T /mnt/data
$ find /mnt/data -type f -printf '%s\n' | sort | uniq -c | awk -c '
function bscalc(_in) { "bscalc -H " _in | getline _out; return _out; }
$1 == 1 { nr_uniqs += $1; size_uniqs += $1 * $2; }
$1 != 1 { nr_twins += $1; size_twins += $1 * $2; }
END {
printf "Uniques: total %d size %s\n", nr_uniqs, bscalc(size_uniqs);
printf "Twins: total %d size %s\n", nr_twins, bscalc(size_twins);
}'
Uniques: total 202799 size 19.76 TiB
Twins: total 3074218 size 11.78 TiB
actual behavior
Basic rmlint invocation without --hash-unmatched (ignore --without-fiemap, it's just there to speed up preprocessing, progress-bars were also trimmed):
$ rmlint -T df,dd -j --progress -o pretty -c sh:handler=clone --hidden --without-fiemap /mnt/data
Traversing (3276566 usable files / 0 + 0 ignored files / folders)
Preprocessing (reduces files to 3034739 / found 33504 other lint)
Matching (100 dupes of 63 originals; 12058,91 GB to scan in 3067241 files, ETA: 7d 14h 55m 44s)
^C
Control rmlint invocation with --hash-uniques:
$ rmlint -T df,dd -j --progress -o pretty -c sh:handler=clone --hidden --without-fiemap --hash-uniques /mnt/data
Traversing (3276566 usable files / 0 + 0 ignored files / folders)
Preprocessing (reduces files to 3237535 / found 33504 other lint)
Matching (7 dupes of 7 originals; 32301,25 GB to scan in 3270955 files, ETA: 108d 8h 40m 45s)
^C
Now, --hash-unmatched:
$ rmlint -T df,dd -j --progress -o pretty -c sh:handler=clone --hidden --without-fiemap --hash-unmatched /mnt/data
Traversing (3276566 usable files / 0 + 0 ignored files / folders)
Preprocessing (reduces files to 3237535 / found 33504 other lint)
Matching (7 dupes of 7 originals; 32301,25 GB to scan in 3270955 files, ETA: 120d 9h 31m 56s)
^C
expected behavior
Isn't --hash-unmatched supposed to only scan size twins (i. e. 12 TB at most)?
I can make --hash-unmatched do what it says on the tin with this code, but it feels hacky:
https://github.com/sahib/rmlint/blob/675089dee9453134d2347ef00222f5f6d1f30979/lib/shredder.c#L839-L842
I wonder if there is something else subtly wrong in the code.
It appears that when --hash-unmatched is used in an unmodified rmlint, this condition is responsible for hashing all the single-file groups:
https://github.com/sahib/rmlint/blob/675089dee9453134d2347ef00222f5f6d1f30979/lib/shredder.c#L855-L859
Could someone please explain what exactly is being done here, what's the idea behind this special case?
Disregard the comment above (the suggested fix is wrong), see proper analysis in the linked PR.