smhasher icon indicating copy to clipboard operation
smhasher copied to clipboard

Bit pattern issue

Open polyminer1 opened this issue 2 years ago • 1 comments

https://github.com/aappleby/smhasher/blob/61a0530f28277f2e850bfc39600ce61d02b518de/src/MurmurHash3.cpp#L255 At this line you initialize a 64 bit hash from a 32 bit seed. Effectively forcing the high 32bit to 0 for every call. The mixing will be more patterned down the line. The input seed should be 64bit to ensure a good mix overall.

polyminer1 avatar Sep 07 '23 20:09 polyminer1

Effectively forcing the high 32bit to 0 for every call

Nope. Firstly it has not really an "initialization" (if one carefully follows the mixing algorithm inside the blocks). Secondly it is not the case here, since 2 x 64 bits buffers get the seed in the low 32bits in: https://github.com/aappleby/smhasher/blob/61a0530f28277f2e850bfc39600ce61d02b518de/src/MurmurHash3.cpp#L261-L262 But they get complexly mixed hereafter during hashing and finalization phase, so almost all 32bits of seed significantly distributed over the whole hash.

The input seed should be 64bit to ensure a good mix overall.

Then it will be not MM3 anymore.

Additionally note:

  1. MM3 is a non-cryptographic hash function, so its main goals are totally different and role of seed is secondary (see below for one of the purposes).
  2. Also don't confuse the seed with salt, which should be also mixed with the input using some cryptographically strong methods like HMAC and co.
  3. Therefore the seed is rather to select the specific computed value of hash function, for instance to distribute different types of objects with theoretically equal block representation across some hash table. (if two hashes use different seeds, they are very likely to compute distinct hash values for any also the same given input).
  4. Unsigned 32-bit seed provides 4 billions "unique" values (if hash function shows good distribution and few collisions), what is fully enough for purposes like described above typization (since 4 billions types are fully enough) etc.

And last but not least, you can have the "seed" as long as you need it - just prefix or suffix the input with some buffer (64 bits or larger) and calculate the hash for whole buffer (with increased length), or calculate it using chain mode (repeated calculation for data in parts), starting or ending with a buffer containing the seed of your arbitrary length.

sebres avatar Sep 07 '23 23:09 sebres