chacha20: add an AVX-512 backend
This adds an AVX-512 backend to chacha20. There are major speedups for long input sizes at the cost of a ~5-20% performance loss for very short inputs. See benchmarks below.
It is largely based on the AVX-2 backend, but with a bit of tuning to get better performance on medium-length inputs.
I spent some time tuning the PAR_BLOCKS parameter and found that a value of 16 (compared to 4 for AVX-2) produced the highest throughput for large input sizes. This achieves the highest ILP without spilling (thanks to the larger register file in AVX-512).
I added special tail handling to get better performance on sizes less than 1024 bytes.
The performance loss on short inputs seems to be due to LLVM making different inlining decisions into the benchmark loop. I'm not sure if this matters much outside a microbenchmark context.
Benchmarks
On a Ryzen 7950X (Zen 4):
| benchmark | AVX-2 throughput | AVX-512 throughput | speedup |
|---|---|---|---|
| chacha20_bench1_16b | 666 MB/s | 640 MB/s | 0.96x |
| chacha20_bench2_256b | 3011 MB/s | 3240 MB/s | 1.08x |
| chacha20_bench3_1kib | 3390 MB/s | 6243 MB/s | 1.84x |
| chacha20_bench4_16kib | 3488 MB/s | 6603 MB/s | 1.89x |
| chacha12_bench1_16b | 941 MB/s | 800 MB/s | 0.85x |
| chacha12_bench2_256b | 4491 MB/s | 4830 MB/s | 1.08x |
| chacha12_bench3_1kib | 5446 MB/s | 9142 MB/s | 1.68x |
| chacha12_bench4_16kib | 5746 MB/s | 10076 MB/s | 1.75x |
| chacha8_bench1_16b | 1066 MB/s | 1000 MB/s | 0.94x |
| chacha8_bench2_256b | 6243 MB/s | 6564 MB/s | 1.05x |
| chacha8_bench3_1kib | 7937 MB/s | 12190 MB/s | 1.54x |
| chacha8_bench4_16kib | 8458 MB/s | 13664 MB/s | 1.62x |
On a Xeon Gold 6530 (Emerald Rapids):
| benchmark | AVX-2 throughput | AVX-512 throughput | speedup |
|---|---|---|---|
| chacha20_bench1_16b | 333 MB/s | 280 MB/s | 0.84x |
| chacha20_bench2_256b | 1430 MB/s | 1802 MB/s | 1.26x |
| chacha20_bench3_1kib | 1587 MB/s | 2723 MB/s | 1.72x |
| chacha20_bench4_16kib | 1645 MB/s | 2925 MB/s | 1.78x |
| chacha12_bench1_16b | 444 MB/s | 355 MB/s | 0.80x |
| chacha12_bench2_256b | 2206 MB/s | 2694 MB/s | 1.22x |
| chacha12_bench3_1kib | 2566 MB/s | 3864 MB/s | 1.51x |
| chacha12_bench4_16kib | 2728 MB/s | 4573 MB/s | 1.68x |
| chacha8_bench1_16b | 484 MB/s | 421 MB/s | 0.87x |
| chacha8_bench2_256b | 2694 MB/s | 3555 MB/s | 1.32x |
| chacha8_bench3_1kib | 3543 MB/s | 5505 MB/s | 1.55x |
| chacha8_bench4_16kib | 3868 MB/s | 6425 MB/s | 1.66x |
Also, requires updating MSRV to 1.89 for stable AVX-512 intrinsics support.
~~Just realized the tests don't actually check the AVX-512 version since the test vectors are too short, so marking as draft while I add more tests.~~
Added tests for this now.
Also, requires updating MSRV to 1.89 for stable AVX-512 intrinsics support.
I would prefer we make a stable release for rand_core v0.10 before merging this.
Yeah, I don't think this is something we should enable right away and it would be good to have an initial release with a 1.85 MSRV.
Maybe it could be gated by a cfg, similar to how the AVX-512 functionality is gated in the aes crate?
Sounds good. I've updated to gate the implementation under a chacha20_avx512 cfg, and added a CI test for AVX-512 (copied from aes's VAES-512 config).