xxHash icon indicating copy to clipboard operation
xxHash copied to clipboard

What does it mean for a user perspective to use non-cryptographic hash functions?

Open stephane-archer opened this issue 2 years ago • 3 comments

I personally use a lot md5sum to find duplicate files, is it suitable for this application? except for secrets is there anything xxhash should not be used for?

stephane-archer avatar Jul 26 '22 14:07 stephane-archer

xxhash is a good choice for this application. Detecting duplicate files can be done safely. If the checksum calculation is just a detector, then it is paired with a full binary verification of the content, checking that files are actually exactly identical, then there is no risk.

The problems start when one wants to only use the checksum value as a final verdict for duplicate file identification. By the pigeon hole principle, there are necessarily different files that can produce the same checksum. However, by the laws of probability, such accidental event can be made astronomically low. Typically, 128-bit values, such as md5sum or xxh128sum, do reach such territory, and it can be assumed that accidental collisions are so unlikely that detecting a checksum collision is as good as finding 2 identical files, no additional check necessary.

Now, this only works for accidental collisions. That's where the cryptographic label comes into play. Cryptographic hash functions are supposed to make it impossible to find 2 files generating the same checksum value, even when trying very hard. But for non-cryptographic ones, it's possible (though difficult) for a malicious actor to manufacture 2 files which generate the same checksum value. On this account, note that even md5sum is already broken. It is already possible to generate 2 files with same md5 checksum value. These days, due to the large increase in computing and memory capabilities, 128-bit is no longer enough to reach such protection level.

This difference only matters when one is not in control of the content of files. For example when files come from an external server, then there is a risk that a malicious actor uses this vector to generate a collision. Now the effect of this collision varies greatly depending on implementation. For example, if the checksum collision is followed by an actual binary comparison, then the collision will only trigger additional binary comparisons, but no loss of data will occur. On the other hand, if one of the 2 incorrectly labelled "duplicated" files is blindly erased, then there is a risk of data loss.

Cyan4973 avatar Jul 29 '22 10:07 Cyan4973

thank you for your excellent explanation, in my case, I'm not worried about an attacker, and comparing the two files byte by byte is too costly (slow network). Is it safe to assume that xxh128sum and md5sum have the same chance of accidental collisions or do cryptographic hash functions have better dispersion? Is the tradeoff only speed/security?

stephane-archer avatar Aug 06 '22 11:08 stephane-archer

Is it safe to assume that xxh128sum and md5sum have the same chance of accidental collisions or do cryptographic hash functions have better dispersion?

Essentially yes. Through (complex) measurements, it can even be argued that xxh128sum dispersion is actually better than md5sum. But the differences are unlikely to matter for your application.

Is the tradeoff only speed/security?

Yes, though in this case, md5sum can't even brag about being "secure" anymore. So it doesn't provide any upside. For a "secure" hash, I would move to 256-bit and select a modern algorithm such as blake2/3.

Cyan4973 avatar Aug 06 '22 14:08 Cyan4973