SVE AES on aarch64
There are wider instructions on aarch64 just like VAES on x86-64. GitHub Actions runners like ubuntu-24.04-arm support them, which should help with testing:
Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Vendor ID: ARM Model name: Neoverse-N2 Model: 0 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 Stepping: r0p0 BogoMIPS: 2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm uscat ilrcpc flagm ssbs sb paca pacg dcpodp sve2 sveaes svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16
Rust standard library doesn't have intrinsics for it though (yet):
- https://github.com/rust-lang/rust-project-goals/issues/270
- https://github.com/rust-lang/stdarch/pull/1509
- https://github.com/rust-lang/rust/pull/118917
AFAIK SVE is a "vector" extensions similar to the V extension in RISC-V. Unfortunately, there are unresolved issues with properly supporting those in higher-level programming languages like C/C++/Rust because of vector sizes varying at runtime (it involves introducing weird "runtime sized" types with a plethora of associated issues). In our trait APIs we also hard code the assumption that number of blocks processed in parallel is known at compile time for every supported backend, which is simply not true for SVE. The best what we could do in the near term is to hard-code several most common sizes, i.e. we would have several SVE backends with different number of blocks processed in parallel.
I think that is fair, hardware implementations from my understanding typically support some max width natively, so having something like 256/512/1024 will cover a lot of ground.
SVE support in Rust is progressing nicely, so hopefully in not so distant future there will be a nicer way to handle this.
There was a PR: https://github.com/RustCrypto/block-ciphers/pull/403/
Unfortunately, there are unresolved issues with properly supporting those in higher-level programming languages like C/C++/Rust because of vector sizes varying at runtime (it involves introducing weird "runtime sized" types with a plethora of associated issues).
Using the C intrinsics for VLA programming is somewhat awkward but I don't think that's really a problem for implementing SVE (or RVV) backends if there's interest, based on the work I did in previous PRs.
In our trait APIs we also hard code the assumption that number of blocks processed in parallel is known at compile time for every supported backend, which is simply not true for SVE.
I do not think it would take much to change the cipher API to adapt better for VLA. The block chunking mechanism could be changed to query the backend for current optimal size characteristics, rather than sending fixed length chunks.
Both SVE and RVV have mechanisms to query the host for this information on demand.
In fact, I think going in that direction might open up other opportunities for improvement on non-VLA architectures, given the increasing situations where CPU topology and related feature set is somehow dynamic, e.g., asymmetric core types (power vs efficiency), different caching strategies (depending on whether preferred CCD has 3D V-Cache).
The backends are already queried dynamically for SIMD feature detection anyway, so in a way it's a natural extension of that.
The best what we could do in the near term is to hard-code several most common sizes, i.e. we would have several SVE backends with different number of blocks processed in parallel.
This would probably be good enough for most uses given that in practice the vector lengths available are limited by the actual hardware implementations, which tend not to vary widely. Probably 4 or 5 fixed size implementations would cover almost everything available.
I do not think it would take much to change the cipher API to adapt better for VLA. The block chunking mechanism could be changed to query the backend for current optimal size characteristics, rather than sending fixed length chunks.
I thought about it, but I was unable to create a satisfactory design with the current Rust capabilities. Note that the design should enable efficient composition of crates (e.g. aes, ctr, ghash). Either way, we should discuss potential designs in the traits repo. We may include such redesign into a future breaking release, but I don't think we should do it in the upcoming breaking release.
Yeah, it sounds like an interesting idea, but this release series has been so long in the tooth I'm a little afraid of doing any sort of major redesign right now.
Sounds great for something to explore with the next breaking set of releases to the traits crates.
I agree that it would make sense to wait until after the next release to consider a hypothetical trait redesign.