xxHash
xxHash copied to clipboard
What does it mean for a user perspective to use non-cryptographic hash functions?
I personally use a lot md5sum
to find duplicate files, is it suitable for this application? except for secrets is there anything xxhash
should not be used for?
xxhash
is a good choice for this application.
Detecting duplicate files can be done safely. If the checksum calculation is just a detector, then it is paired with a full binary verification of the content, checking that files are actually exactly identical, then there is no risk.
The problems start when one wants to only use the checksum value as a final verdict for duplicate file identification. By the pigeon hole principle, there are necessarily different files that can produce the same checksum. However, by the laws of probability, such accidental event can be made astronomically low. Typically, 128-bit values, such as md5sum
or xxh128sum
, do reach such territory, and it can be assumed that accidental collisions are so unlikely that detecting a checksum collision is as good as finding 2 identical files, no additional check necessary.
Now, this only works for accidental collisions.
That's where the cryptographic label comes into play.
Cryptographic hash functions are supposed to make it impossible to find 2 files generating the same checksum value, even when trying very hard.
But for non-cryptographic ones, it's possible (though difficult) for a malicious actor to manufacture 2 files which generate the same checksum value.
On this account, note that even md5sum
is already broken. It is already possible to generate 2 files with same md5
checksum value. These days, due to the large increase in computing and memory capabilities, 128-bit is no longer enough to reach such protection level.
This difference only matters when one is not in control of the content of files. For example when files come from an external server, then there is a risk that a malicious actor uses this vector to generate a collision. Now the effect of this collision varies greatly depending on implementation. For example, if the checksum collision is followed by an actual binary comparison, then the collision will only trigger additional binary comparisons, but no loss of data will occur. On the other hand, if one of the 2 incorrectly labelled "duplicated" files is blindly erased, then there is a risk of data loss.
thank you for your excellent explanation,
in my case, I'm not worried about an attacker, and comparing the two files byte by byte is too costly (slow network).
Is it safe to assume that xxh128sum
and md5sum
have the same chance of accidental collisions or do cryptographic hash functions have better dispersion?
Is the tradeoff only speed/security?
Is it safe to assume that
xxh128sum
andmd5sum
have the same chance of accidental collisions or do cryptographic hash functions have better dispersion?
Essentially yes. Through (complex) measurements, it can even be argued that xxh128sum
dispersion is actually better than md5sum
. But the differences are unlikely to matter for your application.
Is the tradeoff only speed/security?
Yes, though in this case, md5sum
can't even brag about being "secure" anymore. So it doesn't provide any upside.
For a "secure" hash, I would move to 256-bit and select a modern algorithm such as blake2/3.