block-ciphers AES-CTR: 0.9.0-rc.2 is slower on AVX2-only CPUs than 0.8.4

I built a benchmark tool which measures throughput of AES-CTR on 8k buffer using various versions of aes.

I noticed a significant slowdown between 0.8.4 and 0.9.0-rc.2 versions. I think it is related to inlining in autodetect.rs and to switching from 8 to 9 blocks per run. I drafted the patch here where I restore 8 blocks per run and wrappers in autodetect.rs to the version used in 0.8.4. VAES code is still there, i.e. the fix is not a breaking change.

Below are performance numbers on two machines.

One machine (Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz):

0.8.4:      Avg: 3373.41 MiB/s | Median: 3478.85 | Min: 2683.36 | Max: 3677.39
0.9.0-rc.2: Avg: 2338.59 MiB/s | Median: 2393.62 | Min: 2066.86 | Max: 2459.26
fix:        Avg: 3598.64 MiB/s | Median: 3713.41 | Min: 2730.50 | Max: 3864.82

Another machine (AMD EPYC-Milan Processor):

0.8.4:      Avg: 7637.36 MiB/s | Median: 8301.11 | Min: 3398.54 | Max: 8330.20
0.9.0-rc.2: Avg: 4451.80 MiB/s | Median: 4979.17 | Min: 2435.00 | Max: 4986.76
fix:        Avg: 7601.95 MiB/s | Median: 8267.81 | Min: 3375.63 | Max: 8278.00

It was built with cargo build --release in all cases.

To reproduce this, run the following commits of my benchmark tool:

0.8.4 https://github.com/starius/rust-aes-ctr-bench/commit/f680618f6020dc7f7314d17ed2d42b5a04d8b3e1
0.9.0-rc.2 https://github.com/starius/rust-aes-ctr-bench/commit/2ef76fc7e05829742670d3cdfd29724da899699a
fix https://github.com/starius/rust-aes-ctr-bench/commit/1dc16db49336120a662212fed37447f4b57016fe

The only difference between them is versions of aes and ctr used.

I'm attaching the flamegraph generated for version 0.9.0-rc.2. It demonstrates that 24% of time is spent in <cipher::stream::wrapper::StreamCipherCoreWrapper<T> as cipher::stream::StreamCipher>::try_apply_keystream_inout

Dec 02 '25 15:12 starius

I think we should probably go back to 8 blocks

Dec 02 '25 15:12 tarcieri

Going back to 8 blocks was the first thing I tried. It brought some speedup, but the most gains were achieved by changing the wrappers in autodetect.rs to facilitate inlining.

One more thing: until I added the following to my Cargo.toml, even my patch didn't work:

[profile.release]
codegen-units = 1
lto = "thin"

Without this, it produced a binary where the calls in the wrappers were not inlined (I checked by looking at objdump) and it was still slow.

Could you take a look at the inlining, please? It seems that this is the key to performance.

Dec 02 '25 15:12 starius

If you're getting speedups from codegen-units = 1 then there is a high possibility there are missed inlining opportunities.

It's something we've done investigations of elsewhere

Dec 02 '25 16:12 tarcieri

I'm attaching the flamegraph generated for version 0.9.0-rc.2.

The flamegraph shows that encrypt_par did not get inlined. This causes generated CTR and keystream blocks to be spilled to stack instead of staying in XMM registers. Could you check if adding #[inline] or #[inline(always)] to it has a significant effect?

Dec 03 '25 02:12 newpavlov