rage
rage copied to clipboard
Rage is 38% slower at encrypting than Go implementation
EDIT: Current performance
Just tried to encrypt random 2GB file — 5.37s @ Rust vs 1.07s @ Go.
Go is not great — this includes performance as well; so we could probably do better with Rage!
Note: I plan on adding core::arch
and packed_simd
optimizations to the chacha20
and poly1305
crates soon
Yeah, most of the performance difference is that rage
's cryptographic dependencies are essentially pure-Rust at this point, while age
is using the Go standard library which includes assembly for basically everything. See also #38, which was necessary because of Go's scrypt
being around 64x faster due to having SHA-2 assembly.
The other place to look at for optimisation is my implementation of STREAM. Currently encryption of each chunk involves an allocation because I am not using the Aead::encrypt_in_place
API. We could instead allocate a ciphertext-sized buffer inside StreamWriter
and then track how much plaintext we are writing into it.
Yeah, STREAM seems to be the slowest part here.
Multicore optimizations should also speed-up things massively. See the gist for a tiny & very performant example of Rust threads.
I ran a brief test on my laptop, of the form:
time head -c 2147483648 /dev/urandom | cargo run --release -- -r age1somerecipient >/dev/null
Switching to Aead::encrypt_in_place
(instead of letting it allocate a new ~64 kiB Vec
for every chunk) does not speed up encryption at all (before and after are both around 15.5 seconds to encrypt 2 GiB on my laptop). I'll hunt for other possible hotspots in my code, but I expect that the necessary performance work is on the upstream crates.
Oh heh, looks like my benchmarks were being limited by the speed of /dev/urandom
- I tested age
and measured the same 15.5 seconds. Switching to /dev/zero
I get:
$ time head -c 2147483648 /dev/zero | go run ./cmd/age -r age1somerecipient >/dev/null
real 0m2.536s
user 0m2.596s
sys 0m1.242s
$ time head -c 2147483648 /dev/zero | cargo run --release -- -r age1somerecipient >/dev/null
Finished release [optimized] target(s) in 0.10s
Running `target/release/rage -r age1somerecipient`
real 0m9.313s
user 0m9.404s
sys 0m1.456s
$ # Apply patch
$ time head -c 2147483648 /dev/zero | cargo run --release -- -r age1somerecipient >/dev/null
Finished release [optimized] target(s) in 0.09s
Running `target/release/rage -r age1somerecipient`
real 0m9.200s
user 0m9.230s
sys 0m1.511s
Still no difference using Aead::encrypt_in_place
(the minor delta between unpatched and patched was within the system noise), but I see rage
being around 4x slower than age
.
All right, squeaky wheel gets the grease. The chacha20
crate was previously running at ~3.5cpb on my laptop with the SSE2 backend.
I rewrote the buffering logic and added a new AVX2 backend which can compute two ChaCha20 blocks in parallel. I've got it down to ~1.4cpb now:

Will double check I didn't break anything and cut a new release soon, then bump the chacha20poly1305
crate.
chacha20poly1305
v0.3.1 is out with the AVX2 backend, so all you should need to do is cargo update
and then build with the following $RUSTFLAGS
:
RUSTFLAGS="-Ctarget-feature=+avx2"
Here's the benchmarked improvement on the full AEAD construction:

...so encryption is ~60% faster, and decryption is unchanged.
Note that there's still some low hanging fruit, like a SIMD implementation of Poly1305, and pipelining the execution of ChaCha20 and Poly1305 so they can execute in parallel.
I adapted the chacha20
benchmark to rage, and get between 9.8 and 10.3 cycles per byte on current master.
Before vs after cargo update
:
stream/encrypt/131072 time: [1328643.4735 cycles 1362445.9702 cycles 1406940.9323 cycles]
thrpt: [10.7341 cpb 10.3946 cpb 10.1367 cpb]
Found 11 outliers among 100 measurements (11.00%)
1 (1.00%) low severe
1 (1.00%) low mild
6 (6.00%) high mild
3 (3.00%) high severe
---
stream/encrypt/131072 time: [1113173.3014 cycles 1131564.0177 cycles 1152718.4918 cycles]
thrpt: [8.7945 cpb 8.6331 cpb 8.4928 cpb]
change:
time: [-19.391% -17.546% -15.723%] (p = 0.00 < 0.05)
thrpt: [+18.656% +21.279% +24.055%]
Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
2 (2.00%) low mild
7 (7.00%) high mild
5 (5.00%) high severe
Before cargo update
vs after with RUSTFLAGS="-Ctarget-feature=+avx2"
:
stream/encrypt/131072 time: [1281326.5182 cycles 1287772.9972 cycles 1296258.5135 cycles]
thrpt: [9.8897 cpb 9.8249 cpb 9.7757 cpb]
---
stream/encrypt/131072 time: [739731.5694 cycles 743151.4588 cycles 747047.6420 cycles]
thrpt: [5.6995 cpb 5.6698 cpb 5.6437 cpb]
change:
time: [-42.466% -42.006% -41.531%] (p = 0.00 < 0.05)
thrpt: [+71.030% +72.433% +73.811%]
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
1 (1.00%) low severe
1 (1.00%) low mild
3 (3.00%) high mild
1 (1.00%) high severe
I've opened #58 with the benchmark and the dependency update.
More measurements of the improvement on my desktop (i7-8700K CPU @ 3.70GHz).
Before cargo update
(e78c6a24) vs after cargo update
(eee96f4c2f):
Benchmarking stream/encrypt/131072: Collecting 100 samples in estimated 6.6372 s (20k iter
stream/encrypt/131072 time: [1212026.6085 cycles 1214296.5898 cycles 1217916.5950 cycles]
thrpt: [9.2920 cpb 9.2643 cpb 9.2470 cpb]
Found 15 outliers among 100 measurements (15.00%)
8 (8.00%) low severe
4 (4.00%) low mild
1 (1.00%) high mild
2 (2.00%) high severe
---
Benchmarking stream/encrypt/131072: Collecting 100 samples in estimated 5.4919 s (20k iter
stream/encrypt/131072 time: [1002757.7861 cycles 1002970.1976 cycles 1003166.3307 cycles]
thrpt: [7.6536 cpb 7.6521 cpb 7.6504 cpb]
change:
time: [-17.660% -17.306% -16.972%] (p = 0.00 < 0.05)
thrpt: [+20.441% +20.928% +21.448%]
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
7 (7.00%) low severe
1 (1.00%) low mild
Before cargo update
vs after cargo update
with RUSTFLAGS="-Ctarget-feature=+avx2"
:
Benchmarking stream/encrypt/131072: Collecting 100 samples in estimated 6.6394 s (20k iter
stream/encrypt/131072 time: [1212129.4345 cycles 1212365.0293 cycles 1212570.7408 cycles]
thrpt: [9.2512 cpb 9.2496 cpb 9.2478 cpb]
change:
time: [-0.7644% -0.3612% +0.0306%] (p = 0.05 > 0.05)
thrpt: [-0.0306% +0.3625% +0.7702%]
No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
7 (7.00%) low severe
5 (5.00%) low mild
---
Benchmarking stream/encrypt/131072: Collecting 100 samples in estimated 5.7746 s (30k iter
stream/encrypt/131072 time: [702772.3597 cycles 702891.9797 cycles 703037.8149 cycles]
thrpt: [5.3638 cpb 5.3626 cpb 5.3617 cpb]
change:
time: [-42.097% -41.931% -41.730%] (p = 0.00 < 0.05)
thrpt: [+71.615% +72.209% +72.703%]
Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
8 (8.00%) low severe
5 (5.00%) low mild
1 (1.00%) high mild
And current master without vs with AVX2:
Benchmarking stream/encrypt/131072: Collecting 100 samples in estimated 5.4703 s (20k iter
stream/encrypt/131072 time: [998447.7639 cycles 998734.2589 cycles 999051.0488 cycles]
thrpt: [7.6222 cpb 7.6197 cpb 7.6176 cpb]
Found 10 outliers among 100 measurements (10.00%)
7 (7.00%) low severe
3 (3.00%) low mild
---
Benchmarking stream/encrypt/131072: Collecting 100 samples in estimated 5.7943 s (30k iter
stream/encrypt/131072 time: [705379.2776 cycles 705501.0598 cycles 705635.0365 cycles]
thrpt: [5.3836 cpb 5.3825 cpb 5.3816 cpb]
change:
time: [-29.475% -29.247% -28.993%] (p = 0.00 < 0.05)
thrpt: [+40.830% +41.337% +41.795%]
Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
8 (8.00%) low severe
3 (3.00%) low mild
1 (1.00%) high mild
And age
vs rage
(current master of each) on my desktop:
$ head -c 2147483648 /dev/zero | time tmp/age -r age1fl45as7lv56lzg3tv76v0nkew0rukgl706gycrkmqq6ju86rzgdssjs7yt >/dev/null
0.96user 0.48system 0:01.45elapsed 99%CPU (0avgtext+0avgdata 2840maxresident)k
0inputs+0outputs (0major+763minor)pagefaults 0swaps
$ head -c 2147483648 /dev/zero | time tmp/age -r age1fl45as7lv56lzg3tv76v0nkew0rukgl706gycrkmqq6ju86rzgdssjs7yt >/dev/null
1.09user 0.39system 0:01.47elapsed 100%CPU (0avgtext+0avgdata 2840maxresident)k
0inputs+0outputs (0major+763minor)pagefaults 0swaps
$ head -c 2147483648 /dev/zero | time target/release/rage -r age1fl45as7lv56lzg3tv76v0nkew0rukgl706gycrkmqq6ju86rzgdssjs7yt >/dev/null
4.57user 0.62system 0:05.19elapsed 100%CPU (0avgtext+0avgdata 1852maxresident)k
0inputs+0outputs (0major+503minor)pagefaults 0swaps
$ head -c 2147483648 /dev/zero | time target/release/rage -r age1fl45as7lv56lzg3tv76v0nkew0rukgl706gycrkmqq6ju86rzgdssjs7yt >/dev/null
4.50user 0.68system 0:05.22elapsed 99%CPU (0avgtext+0avgdata 1856maxresident)k
0inputs+0outputs (0major+505minor)pagefaults 0swaps
$ RUSTFLAGS="-Ctarget-feature=+avx2" cargo build --release
$ head -c 2147483648 /dev/zero | time target/release/rage -r age1fl45as7lv56lzg3tv76v0nkew0rukgl706gycrkmqq6ju86rzgdssjs7yt >/dev/null
3.12user 0.75system 0:03.88elapsed 99%CPU (0avgtext+0avgdata 1852maxresident)k
0inputs+0outputs (0major+504minor)pagefaults 0swaps
$ head -c 2147483648 /dev/zero | time target/release/rage -r age1fl45as7lv56lzg3tv76v0nkew0rukgl706gycrkmqq6ju86rzgdssjs7yt >/dev/null
2.98user 0.89system 0:03.88elapsed 99%CPU (0avgtext+0avgdata 1852maxresident)k
0inputs+0outputs (0major+504minor)pagefaults 0swaps
So the current status is that rage
is 3.57x slower than age
, and rage
compiled with AVX2 is 2.66x slower than age
.
Could be great to understand what exactly slows us down at this point. Not sure what is the best way to profile traces in Rust.
rage compiled with AVX2 is 2.66x slower than age.
Is there any reason to not compile with AVX2? I think, almost every x86 cpu nowadays supports it?
The three main things are:
- Poly1305 implementation isn't SIMD. See all of the discussion here about that
-
chacha20poly1305
crate is 2-pass instead of 1-pass. ~~I can open an issue for that~~ if anyone wants to try to convert it to 1-pass as it should be fairly easy (edit: opened https://github.com/RustCrypto/AEADs/issues/74) -
chacha20
crate is still slower than it could be even with AVX2. See benchmarking versusc2-chacha
crate here (c2-chacha
is ~45% faster)
Re: the third item the c2-chacha
crate impls the stream-cipher
API. With a small API change to the chacha20poly1305
crate I could make the underlying ChaCha implementation generic so you could swap in its implementation.
Is there any reason to not compile with AVX2? I think, almost every x86 cpu nowadays supports it?
Nope! Looking at the December 2019 Steam hardware survey, 77.05% of the surveyed Windows machines (which made up 96.86% of the survey, so I'm not looking at the macOS or Linux figures) support AVX2. Given that gamers tend towards newer hardware, this is most likely an upper bound on support (by how much, IDK). See also this Rust discussion thread.
@tarcieri i've thought a nonce-misuse-resistant construction cannot be 1-pass? Specifically, SIV. Am I wrong?
@paulmillr that’s true (for encryption, decryption in a SIV mode can still be 1-pass), but we’re talking about ChaCha20Poly1305 here...
If anyone would like to try wiring it up, chacha20poly1305
v0.4 now has a generic ChaChaPoly1305
type which should theoretically be usable with the ChaCha20
implementation in the c2-chacha
crate.
Benchmarks showed its AVX2 backend was about 40% faster than the chacha20
crate. I've been meaning to investigate why and see if there's something suboptimal in the chacha20
crate (whose implementation is significantly simpler than what's in c2-chacha
+ ppv-lite86
)
Ooh, thanks! I'll try that today :smiley:
Also note that the chacha20
dependency in chacha20poly1305
is now optional if c2-chacha
ends up working out.
What about parallelism / multicore usage? Anything we could do here?
STREAM is "embarrassingly parallel" so pick any parallelization strategy you want
Current master of each (measured on my laptop - Thinkpad P1 with Xeon E-2176M):
$ head -c 2147483648 /dev/zero | time tmp/age -r age1fl45as7lv56lzg3tv76v0nkew0rukgl706gycrkmqq6ju86rzgdssjs7yt >/dev/null
0.98user 0.88system 0:01.85elapsed 100%CPU (0avgtext+0avgdata 8276maxresident)k
0inputs+0outputs (0major+181minor)pagefaults 0swaps
$ head -c 2147483648 /dev/zero | time target/release/rage -r age1fl45as7lv56lzg3tv76v0nkew0rukgl706gycrkmqq6ju86rzgdssjs7yt >/dev/null
3.28user 1.01system 0:04.29elapsed 99%CPU (0avgtext+0avgdata 4180maxresident)k
0inputs+0outputs (0major+192minor)pagefaults 0swaps
$ RUSTFLAGS="-Ctarget-feature=+avx2" cargo build --release
$ head -c 2147483648 /dev/zero | time target/release/rage -r age1fl45as7lv56lzg3tv76v0nkew0rukgl706gycrkmqq6ju86rzgdssjs7yt >/dev/null
3.34user 0.89system 0:04.23elapsed 99%CPU (0avgtext+0avgdata 3896maxresident)k
0inputs+0outputs (0major+188minor)pagefaults 0swaps
rage
is 2.32x slower than age
, and rage
compiled with AVX2 is 2.29x slower than age. These are basically the same now due to c2-chacha
, but there's some small overhead that is improved with explicit AVX2 compilation. Not enough for me to worry about though.
I've used pprof
to generate a flame graph for rage
running as part of the above command (without the explicit AVX2 flag):
Reading the 2 GiB input from /dev/zero
is around 17.6% of the execution time, and 23.9% is time inside the c2-chacha
crate.
The largest time sink is clearly the poly1305
crate, which does not yet have an AVX2 implementation and is 53.1% of overall execution. I'm going to work on RustCrypto/universal-hashes#46 this weekend to try and address this.
I've managed to speed up poly1305
by refactoring it 😄
Same age
as last time (dunno why my laptop is feeling faster today):
$ head -c 2147483648 /dev/zero | time tmp/age -r age1fl45as7lv56lzg3tv76v0nkew0rukgl706gycrkmqq6ju86rzgdssjs7yt >/dev/null
1.15user 0.54system 0:01.69elapsed 100%CPU (0avgtext+0avgdata 10264maxresident)k
0inputs+0outputs (0major+180minor)pagefaults 0swaps
Current master of rage
+ current master of poly1305
(equivalent to the published crate):
$ head -c 2147483648 /dev/zero | time target/release/rage -r age1fl45as7lv56lzg3tv76v0nkew0rukgl706gycrkmqq6ju86rzgdssjs7yt >/dev/null
3.37user 0.76system 0:04.13elapsed 99%CPU (0avgtext+0avgdata 41656maxresident)k
0inputs+56outputs (0major+8804minor)pagefaults 0swaps
Current master of rage
+ RustCrypto/universal-hashes#48:
$ head -c 2147483648 /dev/zero | time target/release/rage -r age1fl45as7lv56lzg3tv76v0nkew0rukgl706gycrkmqq6ju86rzgdssjs7yt >/dev/null
2.95user 0.63system 0:03.58elapsed 100%CPU (0avgtext+0avgdata 41636maxresident)k
0inputs+56outputs (0major+8802minor)pagefaults 0swaps
Flame graph (highlighted sections are the poly1305
crate, taking up 41.7% of execution time):
(Note that the flame graphs are probabilistic; running the test repeatedly, I see poly1305
taking anywhere from 41.7% up to 49% of execution time.)
rust go brrrrrr
Re-ran the numbers on my desktop now that we've finally pulled in the poly1305
performance improvements:
https://github.com/FiloSottile/age/commit/31500bfa2f6a36d2958483fc54d6e3cc74154cbc compiled with Go 1.13 (aka what my CI system generates for interoperability testing):
$ head -c 2147483648 /dev/zero | time tmp/age -r age1fl45as7lv56lzg3tv76v0nkew0rukgl706gycrkmqq6ju86rzgdssjs7yt >/dev/null
0.79user 0.67system 0:01.45elapsed 100%CPU (0avgtext+0avgdata 8288maxresident)k
0inputs+0outputs (0major+181minor)pagefaults 0swaps
rage
70cbf9a8bcaf08ecb95d1007e22d682faf6ff222 compiled with Rust 1.45.0 (the MSRV):
$ cargo clean
$ cargo build --release
$ head -c 2147483648 /dev/zero | time target/release/rage -r age1fl45as7lv56lzg3tv76v0nkew0rukgl706gycrkmqq6ju86rzgdssjs7yt >/dev/null
2.34user 0.64system 0:02.99elapsed 99%CPU (0avgtext+0avgdata 4992maxresident)k
0inputs+0outputs (0major+272minor)pagefaults 0swap
$ cargo clean
$ RUSTFLAGS="-Ctarget-feature=+avx2" cargo build --release
$ head -c 2147483648 /dev/zero | time target/release/rage -r age1fl45as7lv56lzg3tv76v0nkew0rukgl706gycrkmqq6ju86rzgdssjs7yt >/dev/null
1.43user 0.67system 0:02.11elapsed 99%CPU (0avgtext+0avgdata 5348maxresident)k
0inputs+0outputs (0major+278minor)pagefaults 0swaps
By those numbers, rage
is 2.06x slower than age
, and rage
compiled with AVX2 is 1.46x slower than age
.
@str4d a few options for additional improvements:
-
asm
implementations of ChaCha20 and/or Poly1305 - Pipelining ChaCha20 and Poly1305 via XMM registers
I assume STREAM is still not parallel? I'd focus on this instead of using low-level dangerous asm
code.