VAES support
Vectorized AES can process more than one block at a time, greatly improving throughput, but it doesn't appear to be used by aes crate yet, which is unfortunate.
Note that AES-NI can already process more than one block-at-a-time by leveraging Instruction Level Parallelism (ILP).
We have separate benchmarks for serial aes*_block vs aes*_blocks where you can see the performance difference.
That said, it would probably be good to add VAES support for microarchitectures where it does provide performance benefits beyond what's possible with ILP.
Absolutely, I am aware of that. Already using *_block and there is a difference indeed.
Those methods are the reason I looked inside wondering if it is also VAES-capable on top of instruction parallelism.
Unfortunately, currently I do not have access to a machine with AVX-512 so can not work on it myself.
If someone will work on this, it also could be worthwhile to also add support of VPCLMULQDQ to the polyvalcrate.
Shouldn't be difficult to get server VM with AVX512 support, I can help with that if you'd like.
@newpavlov polyval is/was written in a way that LLVM will already use VPCLMULQDQ when the target supports it: https://github.com/RustCrypto/universal-hashes/pull/44
...though perhaps we could be explicit about it.
Also I have several servers with AVX-512 support I can test on.
@tarcieri I think it should be a relatively straightforward implementation? You would need to play with number of 512-bit blocks processed in parallel, because VAES instructions may have a different latency/throughput ratio and compiler can use 32 registers instead of 16. I think an optimal number will be bigger than the 8 blocks used in the AES-NI backend. IIUC VAES does not have instructions for key expansion, so you can adapt the existing AES-NI-based code.
The only annoying part should be plumbing in the autodetection module, but we can leave it for later, since VAES support will be gated either way (the relevant intrinsics are Nightly-only right now).
I may draft a PR on this weekend or during next week, if you will not get to it before that.
polyval is/was written in a way that LLVM will already use VPCLMULQDQ when the target supports it:
I thought about using parallelism (VPCLMULQDQ can process four 128-bit blocks at once). But I forgot that GHASH/POLYVAL is inherently sequential, so we can not utilize it while processing one text...
@newpavlov I opened a separate issue for VPCLMULQDQ here, I think that should be (potentially) fairly easy: https://github.com/RustCrypto/universal-hashes/issues/184
POLYVAL/GHASH can be broken down into a parallelizable portion and a sequential portion... there's an accumulation of the output that is inherently sequential, but multiplication of the inputs can be performed in parallel.
Funnily enough, a few years later I'm back to this issue after seeing XMM registers instead of ZMM when feeding 8 blocks at a time to encrypt_blocks/decrypt_blocks.
If there is a need to access the hardware, I can create a VM on my workstation that is AVX512-capable (AMD Zen 4).
At least on Intel CPUs the throughput is 1 CPI for 512-bit/4 blocks encoding/decoding, which is very appealing to me and should be 2x faster than doing 1 block at a time with CPI 0.5.
There was a PR to add VAES support, but the author decided to close it when it was almost ready to merge, for whatever reason: #396
With AVX-512 support's FCP having just been completed, it's probably time to revisit that: https://github.com/rust-lang/rust/issues/111137#issuecomment-2911088418
There was a PR to add VAES support, but the author decided to close it when it was almost ready to merge, for whatever reason: #396
The reason was it sat there for 6 months with no proper review despite making all the changes requested.
You also never actually said you would even merge it or even laid out conditions for what would be required to get it to a merge ready state so from my point of view "almost ready" wasn't a thing.
You and @newpavlov had way more than enough opportunity to respond and follow through if you actually valued the contribution.
@silvanshade I'm sorry we didn't get the PR reviewed in a timely enough manner for you. This is a large project with many moving pieces run by volunteers in their spare time, which leads to relatively slow development cycles. The latest set of breaking releases have taken years. Getting a PR through our review process requires some patience.
That said, the PR had received considerable review at the time you closed it. It was waiting for one final pass prior to merging, and had you not closed it, it probably would've been merged by now.
Here is another opportunity to merge it: https://github.com/RustCrypto/block-ciphers/pull/482
@silvanshade thank you!
We merged #482, and #491 is migrating to intrinsics