dwarfs icon indicating copy to clipboard operation
dwarfs copied to clipboard

Choice of hash for duplicate detection

Open tpwrules opened this issue 3 years ago • 3 comments

It looks like dwarfs formerly used SHA1 and now uses XXH3-128 to detect duplicate files. Unfortunately, it looks like there is no confirmation that files are actually duplicates before discarding one of them. There are files in the wild with different contents but identical SHA1 hashes, and likely these are easy to create with XXH3-128 too. For these files, dwarfs would generate an image that does not extract to the original files.

This is somewhat of an out-there problem, but it would be nice for those of us who are paranoid to have an option to switch to a cryptographically secure hash such as SHA512-256, or disable duplicate detection completely. Disabling it would also be useful for situations where it is known there are no duplicates.

tpwrules avatar Jul 10 '22 01:07 tpwrules

That's a fair point. I'll definitely consider adding a confirmation step (i.e. re-check all duplicates using SHA512-256). I don't know if there's any benefit to disabling duplicate detection given the speed of XXH3, but I'll look into it.

mhx avatar Jul 11 '22 13:07 mhx

Duplicate detection is great for large numbers of small files, but I have several interesting use cases involving small numbers of large files and it is a big waste there.

For example I had a folder of 300x 2GiB files which were 99% similar. Dwarfs got this down to 5GiB in 2 hours with an appropriate back-search size, but the first hour was wasted doing I/O to hash the files I already knew were not duplicated. I also used path ordering to remove that overhead. I even have seen really good results running a single file through Dwarfs for VM images. That was on a mechanical RAID array, but even with today's fancy 2GiB/s NVMe devices, that's a decent amount of time wasted.

Perhaps it would be worth it to add logic to only hash an input file if there is another file of the same size? Trivially, files of different sizes have different hashes. That could save some I/O in the small file case too I think.

tpwrules avatar Jul 12 '22 00:07 tpwrules

Sized based check is very useful — most dup finders like jdupes do that. For the recheck I suggest trying a faster cryptographic hash like the BLAKE2/3 family: good does not always equal slow.

Artoria2e5 avatar Jul 28 '22 02:07 Artoria2e5

I think I've addressed the suggestions from this issue in the wip branch:

  • The hash function can now be freely chosen using the --file-hash option:

    './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=xxh3-64' ran
    1.00 ± 0.06 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=xxh3-128'
    1.08 ± 0.05 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=md5'
    1.08 ± 0.05 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=sha1'
    1.11 ± 0.09 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=blake2b512'
    1.17 ± 0.05 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=sha224'
    1.19 ± 0.04 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=sha512-224'
    1.19 ± 0.09 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=blake2s256'
    1.19 ± 0.04 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=sha512-256'
    1.19 ± 0.08 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=sha512'
    1.20 ± 0.05 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=md5-sha1'
    1.21 ± 0.07 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=sha384'
    1.22 ± 0.06 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=sha256'
    1.27 ± 0.07 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=sha3-224'
    1.28 ± 0.08 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=sha3-256'
    1.35 ± 0.06 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=shake256'
    1.37 ± 0.06 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=sha3-384'
    1.40 ± 0.44 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=shake128'
    1.54 ± 0.05 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=sm3'
    1.66 ± 0.10 times faster than './mkdwarfs -i install-small -o /dev/null --force -l1 --file-hash=sha3-512'
    
  • The deduplication algorithm now only hashes a file if there is another file of the same size. Files with unique sizes won't be hashed at all.

  • Deduplication can be completely disabled by passing --file-hash=none.

I'd appreciate if you could try the new code with your use cases and see if there's an improvement.

mhx avatar Oct 28 '22 16:10 mhx

Please check out the v0.7.0-RC1 release candidate: https://github.com/mhx/dwarfs/releases/tag/v0.7.0-RC1

mhx avatar Nov 08 '22 13:11 mhx

Thanks for the continued work. I tried this and it works great.

A couple documentation nits which would be good to fix too:

  • Clarify in the manpage that the default is still XXH3-128.
  • Put in the manpage that the list of available hashes (appears to be?) dynamically pulled from OpenSSL and can be checked with mkdwarfs -h
  • The "Internal Operation" section should be updated for the new behavior

tpwrules avatar Nov 08 '22 16:11 tpwrules

Thanks for testing and for the feedback! I'll make the changes as suggested.

mhx avatar Nov 08 '22 20:11 mhx

Done, I'm closing this.

mhx avatar Nov 08 '22 20:11 mhx