AES-CTR: 0.9.0-rc.2 is slower on AVX2-only CPUs than 0.8.4
I built a benchmark tool which measures throughput of AES-CTR on 8k buffer using various versions of aes.
I noticed a significant slowdown between 0.8.4 and 0.9.0-rc.2 versions. I think it is related to inlining in autodetect.rs and to switching from 8 to 9 blocks per run. I drafted the patch here where I restore 8 blocks per run and wrappers in autodetect.rs to the version used in 0.8.4. VAES code is still there, i.e. the fix is not a breaking change.
Below are performance numbers on two machines.
One machine (Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz):
0.8.4: Avg: 3373.41 MiB/s | Median: 3478.85 | Min: 2683.36 | Max: 3677.39
0.9.0-rc.2: Avg: 2338.59 MiB/s | Median: 2393.62 | Min: 2066.86 | Max: 2459.26
fix: Avg: 3598.64 MiB/s | Median: 3713.41 | Min: 2730.50 | Max: 3864.82
Another machine (AMD EPYC-Milan Processor):
0.8.4: Avg: 7637.36 MiB/s | Median: 8301.11 | Min: 3398.54 | Max: 8330.20
0.9.0-rc.2: Avg: 4451.80 MiB/s | Median: 4979.17 | Min: 2435.00 | Max: 4986.76
fix: Avg: 7601.95 MiB/s | Median: 8267.81 | Min: 3375.63 | Max: 8278.00
It was built with cargo build --release in all cases.
To reproduce this, run the following commits of my benchmark tool:
- 0.8.4 https://github.com/starius/rust-aes-ctr-bench/commit/f680618f6020dc7f7314d17ed2d42b5a04d8b3e1
- 0.9.0-rc.2 https://github.com/starius/rust-aes-ctr-bench/commit/2ef76fc7e05829742670d3cdfd29724da899699a
- fix https://github.com/starius/rust-aes-ctr-bench/commit/1dc16db49336120a662212fed37447f4b57016fe
The only difference between them is versions of aes and ctr used.
I'm attaching the flamegraph generated for version 0.9.0-rc.2. It demonstrates that 24% of time is spent in <cipher::stream::wrapper::StreamCipherCoreWrapper<T> as cipher::stream::StreamCipher>::try_apply_keystream_inout
I think we should probably go back to 8 blocks
Going back to 8 blocks was the first thing I tried. It brought some speedup, but the most gains were achieved by changing the wrappers in autodetect.rs to facilitate inlining.
One more thing: until I added the following to my Cargo.toml, even my patch didn't work:
[profile.release]
codegen-units = 1
lto = "thin"
Without this, it produced a binary where the calls in the wrappers were not inlined (I checked by looking at objdump) and it was still slow.
Could you take a look at the inlining, please? It seems that this is the key to performance.
If you're getting speedups from codegen-units = 1 then there is a high possibility there are missed inlining opportunities.
It's something we've done investigations of elsewhere
I'm attaching the flamegraph generated for version 0.9.0-rc.2.
The flamegraph shows that encrypt_par did not get inlined. This causes generated CTR and keystream blocks to be spilled to stack instead of staying in XMM registers. Could you check if adding #[inline] or #[inline(always)] to it has a significant effect?