This adds an AVX-512 backend to chacha20. There are major speedups for long input sizes at the cost of a ~5-20% performance loss for very short inputs. See benchmarks below.

It is largely based on the AVX-2 backend, but with a bit of tuning to get better performance on medium-length inputs.

I spent some time tuning the PAR_BLOCKS parameter and found that a value of 16 (compared to 4 for AVX-2) produced the highest throughput for large input sizes. This achieves the highest ILP without spilling (thanks to the larger register file in AVX-512).

I added special tail handling to get better performance on sizes less than 1024 bytes.

The performance loss on short inputs seems to be due to LLVM making different inlining decisions into the benchmark loop. I'm not sure if this matters much outside a microbenchmark context.

Benchmarks

On a Ryzen 7950X (Zen 4):

benchmark	AVX-2 throughput	AVX-512 throughput	speedup
chacha20_bench1_16b	666 MB/s	640 MB/s	0.96x
chacha20_bench2_256b	3011 MB/s	3240 MB/s	1.08x
chacha20_bench3_1kib	3390 MB/s	6243 MB/s	1.84x
chacha20_bench4_16kib	3488 MB/s	6603 MB/s	1.89x
chacha12_bench1_16b	941 MB/s	800 MB/s	0.85x
chacha12_bench2_256b	4491 MB/s	4830 MB/s	1.08x
chacha12_bench3_1kib	5446 MB/s	9142 MB/s	1.68x
chacha12_bench4_16kib	5746 MB/s	10076 MB/s	1.75x
chacha8_bench1_16b	1066 MB/s	1000 MB/s	0.94x
chacha8_bench2_256b	6243 MB/s	6564 MB/s	1.05x
chacha8_bench3_1kib	7937 MB/s	12190 MB/s	1.54x
chacha8_bench4_16kib	8458 MB/s	13664 MB/s	1.62x

On a Xeon Gold 6530 (Emerald Rapids):

benchmark	AVX-2 throughput	AVX-512 throughput	speedup
chacha20_bench1_16b	333 MB/s	280 MB/s	0.84x
chacha20_bench2_256b	1430 MB/s	1802 MB/s	1.26x
chacha20_bench3_1kib	1587 MB/s	2723 MB/s	1.72x
chacha20_bench4_16kib	1645 MB/s	2925 MB/s	1.78x
chacha12_bench1_16b	444 MB/s	355 MB/s	0.80x
chacha12_bench2_256b	2206 MB/s	2694 MB/s	1.22x
chacha12_bench3_1kib	2566 MB/s	3864 MB/s	1.51x
chacha12_bench4_16kib	2728 MB/s	4573 MB/s	1.68x
chacha8_bench1_16b	484 MB/s	421 MB/s	0.87x
chacha8_bench2_256b	2694 MB/s	3555 MB/s	1.32x
chacha8_bench3_1kib	3543 MB/s	5505 MB/s	1.55x
chacha8_bench4_16kib	3868 MB/s	6425 MB/s	1.66x

Nov 01 '25 23:11 caelunshun

Also, requires updating MSRV to 1.89 for stable AVX-512 intrinsics support.

Nov 01 '25 23:11 caelunshun

~~Just realized the tests don't actually check the AVX-512 version since the test vectors are too short, so marking as draft while I add more tests.~~

Added tests for this now.

Nov 02 '25 19:11 caelunshun

Also, requires updating MSRV to 1.89 for stable AVX-512 intrinsics support.

I would prefer we make a stable release for rand_core v0.10 before merging this.

Nov 03 '25 08:11 dhardy

Yeah, I don't think this is something we should enable right away and it would be good to have an initial release with a 1.85 MSRV.

Maybe it could be gated by a cfg, similar to how the AVX-512 functionality is gated in the aes crate?

Nov 03 '25 13:11 tarcieri

Sounds good. I've updated to gate the implementation under a chacha20_avx512 cfg, and added a CI test for AVX-512 (copied from aes's VAES-512 config).

Nov 03 '25 15:11 caelunshun

chacha20: add an AVX-512 backend

Benchmarks