dwarfs
dwarfs copied to clipboard
Choice of hash for duplicate detection
It looks like dwarfs formerly used SHA1 and now uses XXH3-128 to detect duplicate files. Unfortunately, it looks like there is no confirmation that files are actually duplicates before discarding one of them. There are files in the wild with different contents but identical SHA1 hashes, and likely these are easy to create with XXH3-128 too. For these files, dwarfs would generate an image that does not extract to the original files.
This is somewhat of an out-there problem, but it would be nice for those of us who are paranoid to have an option to switch to a cryptographically secure hash such as SHA512-256, or disable duplicate detection completely. Disabling it would also be useful for situations where it is known there are no duplicates.
That's a fair point. I'll definitely consider adding a confirmation step (i.e. re-check all duplicates using SHA512-256). I don't know if there's any benefit to disabling duplicate detection given the speed of XXH3, but I'll look into it.
Duplicate detection is great for large numbers of small files, but I have several interesting use cases involving small numbers of large files and it is a big waste there.
For example I had a folder of 300x 2GiB files which were 99% similar. Dwarfs got this down to 5GiB in 2 hours with an appropriate back-search size, but the first hour was wasted doing I/O to hash the files I already knew were not duplicated. I also used path ordering to remove that overhead. I even have seen really good results running a single file through Dwarfs for VM images. That was on a mechanical RAID array, but even with today's fancy 2GiB/s NVMe devices, that's a decent amount of time wasted.
Perhaps it would be worth it to add logic to only hash an input file if there is another file of the same size? Trivially, files of different sizes have different hashes. That could save some I/O in the small file case too I think.
Sized based check is very useful — most dup finders like jdupes do that. For the recheck I suggest trying a faster cryptographic hash like the BLAKE2/3 family: good does not always equal slow.
I think I've addressed the suggestions from this issue in the wip branch:
-
The hash function can now be freely chosen using the
--file-hashoption:'./mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=xxh3-64' ran 1.00 ± 0.06 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=xxh3-128' 1.08 ± 0.05 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=md5' 1.08 ± 0.05 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=sha1' 1.11 ± 0.09 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=blake2b512' 1.17 ± 0.05 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=sha224' 1.19 ± 0.04 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=sha512-224' 1.19 ± 0.09 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=blake2s256' 1.19 ± 0.04 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=sha512-256' 1.19 ± 0.08 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=sha512' 1.20 ± 0.05 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=md5-sha1' 1.21 ± 0.07 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=sha384' 1.22 ± 0.06 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=sha256' 1.27 ± 0.07 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=sha3-224' 1.28 ± 0.08 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=sha3-256' 1.35 ± 0.06 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=shake256' 1.37 ± 0.06 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=sha3-384' 1.40 ± 0.44 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=shake128' 1.54 ± 0.05 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=sm3' 1.66 ± 0.10 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=sha3-512' -
The deduplication algorithm now only hashes a file if there is another file of the same size. Files with unique sizes won't be hashed at all.
-
Deduplication can be completely disabled by passing
--file-hash=none.
I'd appreciate if you could try the new code with your use cases and see if there's an improvement.
Please check out the v0.7.0-RC1 release candidate: https://github.com/mhx/dwarfs/releases/tag/v0.7.0-RC1
Thanks for the continued work. I tried this and it works great.
A couple documentation nits which would be good to fix too:
- Clarify in the manpage that the default is still XXH3-128.
- Put in the manpage that the list of available hashes (appears to be?) dynamically pulled from OpenSSL and can be checked with
mkdwarfs -h - The "Internal Operation" section should be updated for the new behavior
Thanks for testing and for the feedback! I'll make the changes as suggested.
Done, I'm closing this.