aHash Consider runtime detecting AES on x86 or using CRC

They recently defined microarchitecture levels for x86-64 (Wikipedia) but apparently decided against including AES in any of them. So it seems like applications may in the future start targeting these levels. Especially on Windows 11, which requires fairly recent CPUs, you can absolutely start using x86-64-v3. But as this does not include AES, ahash is still going to use the fallback algorithm. So with these levels being defined the way they are, I don't see ahash ever realistically using AES in an application that isn't just meant for self consumption (where you can use target-cpu=native). So it probably makes sense to look into runtime detection of the feature or potentially introducing a fallback to CRC, which is included in v2. Supposedly CRC is actually quite suitable for hashing: https://twitter.com/pcwalton/status/1460782519733788673

Nov 17 '21 10:11 CryZe

I am planning a refactoring of the way the the hasher is instantiated which I intended for use with specialization. But I suppose it could also be used for runtime feature detection. Is there a specific API that can be used to test for AES?

I had previously looked into CRC but dismissed it because it required 64bit and SSE4.2 which virtually guarantees that AES is available, and while low latency, it only operates on 32bits which limits throughput somewhat.

Nov 27 '21 20:11 tkaitchuck

Maybe you can detect AES with multiversion crate? Docs imply something like this might work:

#[target("x86_64]+aes")]
fn yourfunc() {}

At some level (maybe fairly low down, but somewhere) it prevents inlining, though, it sounds like.

Dec 09 '21 12:12 itamarst

I had previously looked into CRC but dismissed it because it required 64bit and SSE4.2 which virtually guarantees that AES is available

There seem to be a lot of exceptions in the line-up of Intel processors: https://en.wikipedia.org/wiki/AES_instruction_set#Intel I have an old Core-i7 Nehalem processor with CRC support, but not AES.

Feb 18 '22 04:02 ssvb

Yes. Nehalem is the unlucky one in the middle. Core, and Penryn have neither but Westmere and Sandy Bridge have both. So adding support for that is only a 1 year window of CPU manufacturing.

The bigger problem is that it's not a great mixer. I don't know how to use it to outperform folded multiply which it what it would use in the absence of AES instruction now.

Oct 31 '22 23:10 tkaitchuck

Westmere and Sandy Bridge have both

Budget variants of Sandy Bridge (labelled as "Core i3", "Pentium" or "Celeron") didn't have AES support. And Pentium G3420 (also without AES support) from the Haswell family is still being manufactured even today. Based on what I heard, cheap low end processors may sell in larger volumes than their expensive high end siblings, so I'm not sure what is the realistic percentage of processors in actual use in the hands of real users that don't have AES support right now.

Nov 01 '22 00:11 ssvb

The bigger problem is that it's not a great mixer. I don't know how to use it to outperform folded multiply which it what it would use in the absence of AES instruction now.

I'm a little bit concerned about the folded multiply, because it ruins everything if the random secret key happens to be zero. Some non-zero secret keys may result in a poor hashing quality too (so it's not just a 1 in 2^64 chance). I mentioned this earlier in another ticket too.

Nov 01 '22 01:11 ssvb

I'm a little bit concerned about the folded multiply, because it ruins everything if the random secret key happens to be zero.

Sortof. It's used in a more complicated way. Two blocks 64bit are read in, xored with two keys and then the resulting values are folded multiplied together. Then the result is combined with the internal state. So if for any block of the input the plaintext is the same as the secret key then the adjacent block's value will no effect on the output. This means that if you can guess either the first or second 64 bits of the key you can produce an unlimited number of hash collisions because the adjacent 64 bits are effectively ignored. This is not likely to cause a problem for random large strings, and is too improbable to brute force. However if there were some other vulnerability, it could be used to turn it into an attack.

A 'poor quality' update you allude to in the case of a non-zero value is not a concern. To see why, suppose for example that by chance all but one bit in one of the numbers was zero. Then the output would be the other input rotated to start at the position of the one set bit. So even with only one bit set every bit in the output can be affected. This resulting value is combined with the state in a way such that even inputs with a fixed differential cannot be canceled later. (IE: A small difference in one block can't be undone with a small difference in a later block.) Once all the blocks have been combined the finalizer should mix the state well.

If there is a better option, I would be interested in trying it. Looking at _mm_crc32_u64, the best option would be to do something like:

let buffer: [u32;2] = self.buffer.convert();
let buffer[0] = crc(buffer[0], new_data + key1);
let buffer[1] = crc(buffer[1], new_data + key1);
self.buffer = buffer.convert();

The problems with this are:

Getting a 64 bit state requires a lot of awkward manipulation (not shown above) because the instruction takes a 64 bit value but only uses 32 bits of it.
The quality is terrible. - So additional mixing beyond this is needed
Even this version is more than a factor of 2x slower the current fallback.
Due to the nature of CRC reordering of blocks causes collisions. (IE: Hash(A, B, C, D) == Hash(C, D, A, B)) - This means the keys used to mask the input need to change between each block in a non-predictable way.

Is there a better way to go about this? From my perspective the quality issues could be fixed but getting the performance on par is probably impossible because it costs the same as a multiply but only works with 32bits of state.

Nov 02 '22 02:11 tkaitchuck

For reference if folded multiply is not available it now uses this: https://github.com/tkaitchuck/aHash/blob/master/src/operations.rs#L20 Which is 2 multiplies and 5 other single cycle instructions several of which can use instruction level parallelism.

So at the very least any CPU specific version, will need to out perform this most generic version.

Nov 02 '22 03:11 tkaitchuck

Is there a specific API that can be used to test for AES?

is is_x86_feature_detected not good enough?

At the very least changing the documentation to urge users to do runtime detection would be good

Nov 18 '22 22:11 TheIronBorn

aHash aHash copied to clipboard

Consider runtime detecting AES on x86 or using CRC

aHash
aHash copied to clipboard