doc-en
doc-en copied to clipboard
hash_algos() docs should clarify which algos are cryptographic
Triggered by https://news-web.php.net/php.internals/124613. Thanks, @IMSoP!
hash_hmac() has a respective changelog entry:
https://github.com/php/doc-en/blob/feab22a6798fbb9137f9bbdb2b94ae0182cb950e/reference/hash/functions/hash-hmac.xml#L100
I think it's a good idea to also state that in the hash_algos() docs.
Maybe it is sufficient to clarify that hash_hmac_algos() lists these.
This would only solve a tiny portion of the problem I was pointing out.
- First, it assumes that the user knows what a "cryptographic hash" is, and when they should use one over the opposite (a "non-cryptographic hash"?)
- Second, it still leaves them with a list of 44 algorithms to choose from, and no guidance whatsoever
What's really needed is:
- An explanation of different hashing use cases, and terms like "cryptographic hash"
- An explanation of when to use
hash(),hash_hmac(), orpassword_hash() - A list or table with the available algorithms, giving more than just their names
- Guidance on which algorithms to avoid (here's where you can talk about the weaknesses of MD5 and SHA1!)
- Some kind of recommendation of what algorithm users should pick for common use cases, if they're not constrained by compatibility
hash_algos() docs should clarify which algos are cryptographic
I’m not sure it’s actually useful information; at least, it’s largely insufficient. For instance, md4 is “cryptographic“, but you shouldn’t use it for anything cryptography-related unless someone holds a gun to your head.
A common theme in the user-contributed notes for hash() was performance benchmarks, so it's probably worth adding some discussion of that (including why you may not even want the fastest algo). Also, if we're going to have a table of algo information in the documentation, the expected/maximum output size of each would be a good data point to add.
hash() and hash_hmac() should definitely have a common paragraph about their possible use in password situations with reference to password_hash().
(I deleted a bunch of the notes on hash(), there were quite a few that were just benchmarks from 5-10 years ago.)
I just want to be sure of something here: is the goal of this documentation to talk about the PHP functions and how they work, or is the goal to teach developers about how to implement their own version of cryptography?
@damianwadley What do you mean by "implement their own version"? I don't think anyone's expecting users to come up with new, novel, hashing algorithms.
What I am hoping for is some description beyond a name for the 60 different algorithms currently supported by hash(), with some explanation of why a user might want to use them, or why they should avoid them.
While agree that the current documentation is somewhat insufficient, I wouldn't go too much into the details; perhaps we can find some good article(s) to link to, instead.
- An explanation of different hashing use cases, and terms like "cryptographic hash"
A short explantion might be in order, but certainly not a thorough treatment like on https://en.wikipedia.org/wiki/Hash_function or https://en.wikipedia.org/wiki/Cryptographic_hash_function.
- An explanation of when to use
hash(),hash_hmac(), orpassword_hash()
ACK
- A list or table with the available algorithms, giving more than just their names
Hmm, maybe some rough categorization might be in order, but detailed explanation about every single algorithm seems out of scope of the PHP manual. Besides, it's already not easy to keep the simple list up to date.
- Guidance on which algorithms to avoid (here's where you can talk about the weaknesses of MD5 and SHA1!)
That's difficult. Depending on the use case, MD5 and SHA1 might still be fine (and sometimes just necessary for interoperability with already existing hashes). See https://en.wikipedia.org/wiki/Cryptographic_hash_function#Properties for details.
- Some kind of recommendation of what algorithm users should pick for common use cases, if they're not constrained by compatibility
That's difficult, again. Maybe we could attempt some rough categorization of the available algorithms.
A common theme in the user-contributed notes for
hash()was performance benchmarks, so it's probably worth adding some discussion of that (including why you may not even want the fastest algo).
A rough explanation of the performance might make sense, but these benchmarks are pretty useless, in my opinion. After all, some of the algorithms may be implemented with SIMD instructions (but having a fallback if these instructions are not available), and a few even might have hardware support (e.g. https://github.com/php/php-src/pull/4108), and the implementations may even change over time.
Hmm, maybe some rough categorization might be in order, but detailed explanation about every single algorithm seems out of scope of the PHP manual. Besides, it's already not easy to keep the simple list up to date.
I didn't say "detailed explanation", I said "some description beyond a name". The context being that multiple people are claiming that users should be using the hash() function, and choosing the right algorithm; and they don't seem keen on simply adding a function for sha256(), or whatever the "best" algorithm is. So I'm assuming there is more to say about the strengths and weaknesses of different algorithms, in which case we need to present that to users.
Maybe there are some algorithms that can just be labelled "rarely used, included for compatibility with other systems", but right now we don't even have that.
I'm not an expert on hash functions, so take the following with a huge grain of salt (and please correct me, if I'm wrong). As I see it, there are roughly three categories of hash functions:
- checksum algorithms (like crc32, adler32): These can be used to calculate checksums, for instance, to check for transmission errors. For this reason they are supposed to be very fast (and likely simple to implement). They might also be used if you need an integer value (since they require only a couple of bytes), e.g. for a very simple hash table implementation.
- other non-cryptographic algorithms (like fvn, murmur): These can be used to calculate hash values for hash tables (if you ever need to implement one yourself). They should be fast, but still yield a good distribution over arbitrary string inputs.
- cryptographic algorithms (like md5, sha*, blake*): These are supposed to yield hash values which are representitive of their inputs, but are neither guessable (i.e. robust against "pre-image" attacks, i.e you can't guess the input from the hash values), nor prone to collisions (i.e. two distinct inputs yield distinct hash values; that's basically the same as to be representative). Their performance is of secondary concern. Of course, unguessability and collision resistance also depend on the entropy of their result (i.e. the number of relevant bits). E.g. a hash function which would have one bit of entropy, would only be able to distinguish two "categories" of inputs, and would as such be severly prone to collisions (although it would be almost impossible to guess the input from the hash value). This is the reason that there are different variants of several of the hash algorithms: choose the necessary entropy as suitable. Some of the early cryptograhic hash algorithms (such as md4, md5, sha1) have been proven to be prone to collsion attacks, and as such may better be avoided, unless you use them in a way where this doesn't matter much (e.g. for caching, if you ensure that collisions won't be a problem).
So "usually" this boils down to:
- compatibility: use whatever algorithm is required
- cryptographics purposes: use sha2 or sha3
- checksums: use crc32* or adler32
@cmb69, I thought that was a good starting point for beefing up the introduction to the documentation for the hash extension! PR is just a draft, feel free to suggest changes and additions and maybe we can address some of the other areas that @IMSoP identified.
Quick note to not forget about it: maybe link to https://csrc.nist.gov/projects/hash-functions (see https://news-web.php.net/php.internals/124678).