xxHash
xxHash copied to clipboard
New PerlinNoiseAV test in SMHasher
Sorry to write this directly as I'm an "interested person" having my own hash functions, but recently the claim "xxhash passes all SMHasher tests" became imprecise: I've invented a new Perlin Noise test that finds "holes" in several hashes, not just xxhash. Maybe you can fix this as xxhash is too popular to miss that fact. I'm not expecting to gain anything from my post; my komihash
hash function is relatively new, so needs a lot more time to gain traction in this very competitive "hash function" field.
Maybe you can also add komihash
to your tests, it's probably not as fast as xxhash3, though, but useful as a comparison to the "state-of-the-art" hash functions. (wyhash
comparison would be great as well)
https://github.com/avaneev/komihash https://github.com/rurban/smhasher
Could you point at the Perlin Noise test you refer to ?
The only one I could find in rurban's smhasher
is : https://github.com/rurban/smhasher/commit/64b7d10ea92701e707ecba5f6b0bacbdb3d98b8a
I'm not sure but this issue may suggest the following new result of 'PerlinNoiseAV' test:
https://github.com/rurban/smhasher/commit/9ca488454eda4dae9ac795147696135e5cdf7dbe
In these new results, 'Perlin Noise'
test is traditional perlin noise test.
https://github.com/rurban/smhasher/blob/9ca488454eda4dae9ac795147696135e5cdf7dbe/doc/xxh128.txt#L816
But 'PerlinNoise'
test contains new PerlinNoiseAV (AV variant
) test.
https://github.com/rurban/smhasher/blob/9ca488454eda4dae9ac795147696135e5cdf7dbe/doc/xxh128.txt#L859
I dug into this and found a few things:
- To answer your previous question, the problem is in PerlinNoiseAV: https://github.com/rurban/smhasher/blob/master/KeysetTest.h#L358
Testing collisions (128-bit) - Expected 0.0, actual 24116 (1711423593602593105674976100352.00x) (24116)
Testing collisions (high 64-bit) - Expected 0.0, actual 24116 (92776458911.34x) (24116) !!!!!
Testing collisions (high 32-bit) - Expected 1116.2, actual 25260 (22.63x) (24144) !!!!!
Testing collisions (high 25-37 bits) - Worst is 37 bits: 24162/34 (692.56x) !!!!!
Testing collisions (low 64-bit) - Expected 0.0, actual 24116 (92776458911.34x) (24116) !!!!!
Testing collisions (low 32-bit) - Expected 1116.2, actual 25187 (22.57x) (24071) !!!!!
Testing collisions (low 25-37 bits) - Worst is 37 bits: 24153/34 (692.30x) !!!!!
- That keyset consists entirely of messages with length of 16 through 38 bytes, consisting almost entirely of zeroes, with one nonzero 32-bit value at some location in the message.
- The collisions with xxh128 happen entirely on 16-byte messages, so the problem is in XXH3_len_9to16_128b.
- Visually, it appears that the collisions happen almost entirely when both the seed and the key are values with very few (2-4) bits set. For instance, here's one of the hash collisions, printed as bitsets:
hash=11111111111101000100000010101111001000111110101100110000001100001000010010111000001100101010110000111110000101100001111101110011
seed=00000000000000000000000100010000 key=00000001000000010000000000010000
seed=00000000000000000000000100010001 key=00000001000000010000000000010001
seed=00000001000000000000000100010000 key=00000000000000010000000000010000
seed=00000001000000000000000100010001 key=00000000000000010000000000010001
- A simple fix that doesn't seem to impact performance (at least on my my machine) is to improve the seed:
seed = XXH3_mul128_fold64(seed, PRIME64_3);
With this line added at the beginning of XXH3_len_9to16_128b, PerlinNoiseAV passes:
Testing AV variant, 128 count with 4 spacing, 4-12:
Testing collisions (128-bit) - Expected 0.0, actual 0 (0.00x)
Testing collisions (high 64-bit) - Expected 0.0, actual 0 (0.00x)
Testing collisions (high 32-bit) - Expected 1116.2, actual 1137 (1.02x) (21)
Testing collisions (high 25-37 bits) - Worst is 37 bits: 48/34 (1.38x)
Testing collisions (low 64-bit) - Expected 0.0, actual 0 (0.00x)
Testing collisions (low 32-bit) - Expected 1116.2, actual 1109 (0.99x) (-7)
Testing collisions (low 25-37 bits) - Worst is 35 bits: 140/139 (1.00x)
- In case any other length-specific functions struggle with seeds that don't have very many bits set, you can lift that
XXH3_mul128_fold64(seed, PRIME64_3)
all way up to XXH128 without (apparently) impacting performance on small messages or performance on SMHasher (apart from fixing this bug, of course). And, of course, callers can simply do this on their side without changing XXH128 at all.
Yes, a reasonable fix (wyhash did that as well recently) is to add an additional "input seed" shuffling round.