AEADs
AEADs copied to clipboard
`chacha20poly1305`: Implement one-pass encryption and decryption
Builds on https://github.com/RustCrypto/AEADs/pull/415.
Currently this still makes things slower. I've benchmarked per-commit, as migrating to universal-hash 0.5
on its own makes things faster (I presume because we now have a single runtime check for the entire message).
Upgrade to `universal-hash 0.5` and latest `poly1305`:
chacha20poly1305/encrypt/1024
time: [5260.8729 cycles 5262.1619 cycles 5263.5785 cycles]
thrpt: [5.1402 cpb 5.1388 cpb 5.1376 cpb]
change:
time: [-11.713% -11.620% -11.539%] (p = 0.00 < 0.05)
thrpt: [+13.044% +13.147% +13.267%]
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
1 (1.00%) low mild
2 (2.00%) high mild
4 (4.00%) high severe
chacha20poly1305/decrypt/1024
time: [5007.6309 cycles 5009.2505 cycles 5011.0551 cycles]
thrpt: [4.8936 cpb 4.8918 cpb 4.8903 cpb]
change:
time: [+3.3672% +3.4391% +3.5197%] (p = 0.00 < 0.05)
thrpt: [-3.4000% -3.3247% -3.2575%]
Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
2 (2.00%) high mild
4 (4.00%) high severe
chacha20poly1305/encrypt/2048
time: [6629.7546 cycles 6631.6433 cycles 6633.7002 cycles]
thrpt: [3.2391 cpb 3.2381 cpb 3.2372 cpb]
change:
time: [-10.277% -10.201% -10.131%] (p = 0.00 < 0.05)
thrpt: [+11.273% +11.360% +11.454%]
Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
2 (2.00%) low mild
5 (5.00%) high mild
2 (2.00%) high severe
chacha20poly1305/decrypt/2048
time: [5200.1415 cycles 5201.3475 cycles 5202.6369 cycles]
thrpt: [2.5404 cpb 2.5397 cpb 2.5391 cpb]
change:
time: [-1.8221% -1.7513% -1.6664%] (p = 0.00 < 0.05)
thrpt: [+1.6946% +1.7825% +1.8559%]
Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
1 (1.00%) low mild
3 (3.00%) high mild
5 (5.00%) high severe
chacha20poly1305/encrypt/4096
time: [8549.8252 cycles 8551.9567 cycles 8554.4275 cycles]
thrpt: [2.0885 cpb 2.0879 cpb 2.0874 cpb]
change:
time: [-22.983% -22.956% -22.930%] (p = 0.00 < 0.05)
thrpt: [+29.753% +29.797% +29.842%]
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
1 (1.00%) low mild
1 (1.00%) high mild
2 (2.00%) high severe
chacha20poly1305/decrypt/4096
time: [5001.7972 cycles 5004.1684 cycles 5007.4453 cycles]
thrpt: [1.2225 cpb 1.2217 cpb 1.2211 cpb]
change:
time: [-24.395% -24.356% -24.316%] (p = 0.00 < 0.05)
thrpt: [+32.128% +32.198% +32.267%]
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
1 (1.00%) high mild
4 (4.00%) high severe
chacha20poly1305/encrypt/8192
time: [19351.4972 cycles 19354.6604 cycles 19357.8118 cycles]
thrpt: [2.3630 cpb 2.3626 cpb 2.3622 cpb]
change:
time: [-17.125% -17.063% -17.007%] (p = 0.00 < 0.05)
thrpt: [+20.492% +20.573% +20.664%]
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
3 (3.00%) high mild
1 (1.00%) high severe
chacha20poly1305/decrypt/8192
time: [11706.8281 cycles 11709.9539 cycles 11713.9894 cycles]
thrpt: [1.4299 cpb 1.4294 cpb 1.4291 cpb]
change:
time: [-22.190% -22.131% -22.064%] (p = 0.00 < 0.05)
thrpt: [+28.310% +28.420% +28.518%]
Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
1 (1.00%) low severe
4 (4.00%) low mild
4 (4.00%) high severe
chacha20poly1305/encrypt/16384
time: [23738.0720 cycles 23740.9454 cycles 23744.8765 cycles]
thrpt: [1.4493 cpb 1.4490 cpb 1.4489 cpb]
change:
time: [-44.422% -44.391% -44.358%] (p = 0.00 < 0.05)
thrpt: [+79.721% +79.827% +79.927%]
Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
5 (5.00%) high mild
8 (8.00%) high severe
chacha20poly1305/decrypt/16384
time: [6387.6400 cycles 6389.3812 cycles 6391.6555 cycles]
thrpt: [0.3901 cpb 0.3900 cpb 0.3899 cpb]
change:
time: [-76.379% -76.369% -76.355%] (p = 0.00 < 0.05)
thrpt: [+322.93% +323.17% +323.36%]
Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
1 (1.00%) low severe
8 (8.00%) high severe
Using one-pass encryption (relative to `universal-hash 0.5`):
chacha20poly1305/encrypt/1024
time: [8605.9437 cycles 8608.5776 cycles 8611.5313 cycles]
thrpt: [8.4097 cpb 8.4068 cpb 8.4042 cpb]
change:
time: [+63.506% +63.635% +63.754%] (p = 0.00 < 0.05)
thrpt: [-38.933% -38.888% -38.840%]
Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
2 (2.00%) high mild
6 (6.00%) high severe
chacha20poly1305/decrypt/1024
time: [10972.4717 cycles 10976.5257 cycles 10982.2696 cycles]
thrpt: [10.7249 cpb 10.7193 cpb 10.7153 cpb]
change:
time: [+118.89% +119.07% +119.25%] (p = 0.00 < 0.05)
thrpt: [-54.390% -54.352% -54.315%]
Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
5 (5.00%) high mild
5 (5.00%) high severe
chacha20poly1305/encrypt/2048
time: [14190.3986 cycles 14195.1670 cycles 14200.3840 cycles]
thrpt: [6.9338 cpb 6.9312 cpb 6.9289 cpb]
change:
time: [+114.05% +114.18% +114.32%] (p = 0.00 < 0.05)
thrpt: [-53.340% -53.311% -53.282%]
Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
5 (5.00%) high mild
3 (3.00%) high severe
chacha20poly1305/decrypt/2048
time: [16990.9948 cycles 17000.9798 cycles 17011.2346 cycles]
thrpt: [8.3063 cpb 8.3013 cpb 8.2964 cpb]
change:
time: [+226.30% +226.56% +226.78%] (p = 0.00 < 0.05)
thrpt: [-69.399% -69.378% -69.353%]
Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
chacha20poly1305/encrypt/4096
time: [24100.7668 cycles 24114.2547 cycles 24130.6399 cycles]
thrpt: [5.8913 cpb 5.8873 cpb 5.8840 cpb]
change:
time: [+181.86% +181.99% +182.12%] (p = 0.00 < 0.05)
thrpt: [-64.554% -64.538% -64.521%]
Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
2 (2.00%) low severe
3 (3.00%) low mild
8 (8.00%) high severe
chacha20poly1305/decrypt/4096
time: [28341.4014 cycles 28359.3332 cycles 28380.0748 cycles]
thrpt: [6.9287 cpb 6.9237 cpb 6.9193 cpb]
change:
time: [+466.13% +466.49% +466.83%] (p = 0.00 < 0.05)
thrpt: [-82.358% -82.347% -82.336%]
Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
4 (4.00%) high mild
2 (2.00%) high severe
chacha20poly1305/encrypt/8192
time: [41099.8173 cycles 41126.4633 cycles 41165.2353 cycles]
thrpt: [5.0251 cpb 5.0203 cpb 5.0171 cpb]
change:
time: [+112.45% +112.77% +113.23%] (p = 0.00 < 0.05)
thrpt: [-53.102% -53.001% -52.929%]
Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
3 (3.00%) high mild
8 (8.00%) high severe
chacha20poly1305/decrypt/8192
time: [51776.6583 cycles 51815.2619 cycles 51881.6442 cycles]
thrpt: [6.3332 cpb 6.3251 cpb 6.3204 cpb]
change:
time: [+341.97% +342.29% +342.60%] (p = 0.00 < 0.05)
thrpt: [-77.406% -77.390% -77.374%]
Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
1 (1.00%) high mild
4 (4.00%) high severe
chacha20poly1305/encrypt/16384
time: [82441.3891 cycles 82444.0295 cycles 82447.2164 cycles]
thrpt: [5.0322 cpb 5.0320 cpb 5.0318 cpb]
change:
time: [+246.92% +247.07% +247.21%] (p = 0.00 < 0.05)
thrpt: [-71.199% -71.188% -71.175%]
Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
1 (1.00%) high mild
5 (5.00%) high severe
chacha20poly1305/decrypt/16384
time: [100814.9950 cycles 100820.8676 cycles 100828.7900 cycles]
thrpt: [6.1541 cpb 6.1536 cpb 6.1533 cpb]
change:
time: [+1476.9% +1477.7% +1478.4%] (p = 0.00 < 0.05)
thrpt: [-93.664% -93.662% -93.659%]
Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
1 (1.00%) high mild
10 (10.00%) high severe
I think it might be worth experimenting with adding some (unsafe) x86
/x86_64
platform-specific APIs to chacha20
and poly1305
which operate directly on SIMD registers, eschewing trait-based abstractions, and seeing what the performance difference is versus the current implementation with universal-hash
v0.5.
It's not surprising that you get degraded performance with such core loops. Not only you use the stream cipher with built-in buffering which is not needed here, but also each call to apply_keystream
and update_padded
checks target features at runtime.
You should use the block-level ChaCha type and the code should look roughly like this (it's a very rough draft to demonstrate the idea):
let mut cipher: ChaChaCore<U10> = ...;
let mut mac: Poly1305 = ...;
// For simplification I will use closure syntax here instead of
// the explicit implementation of the closure traits
cipher.process_with_backend(|cipher_backend| {
mac.update_with_backend(|mac_backend| {
// process AAD with mac_backend
...
// least common multiple of number of bytes which
// can be processed by each backend in parallel
let lcm_block_size = lcm(
cipher_backend::BlockSize::USIZE * cipher_backend::ParBlocksSize::USIZE,
mac_backend::BlockSize::USIZE * mac_backend::ParBlocksSize::USIZE,
);
let mut iter = buffer.chunks_exact_mut(lcm_block_size);
for lcm_block in &mut iter {
// we do not have the `cast` helper function right now.
// it's probably will be worth to add it later to one
// of our traits or utils crates
let cipher_blocks: &mut [ChaChaParBlocks] = cast_mut(lcm_block);
let mut par_ks_blocks: ChaChaParBlocks = Default::default();
for par_blocks in cipher_blocks {
cipher_backend.gen_par_ks_blocks(&mut par_ks_blocks);
xor(par_blocks, &par_ks_blocks);
}
let mac_blocks: &[Poly1305ParBlocks] = cast(lcm_block);
for par_blocks in cipher_blocks {
mac_backend.proc_par_blocks(par_blocks);
}
}
let tail = iter.into_remainder();
// process tail bytes
...
})
});
The idea is that compiler after inlining the callbacks will generate the core loop for each combination of supported backends. For hypothetical backends which do not support parallel processing (i.e. ParBlocksSize
is equal to U1
for both cipher_backend
and mac_backend
) the core part of the closures will collapse to code looking roughly like this (of course, it's not a literal code, just demonstration of the idea):
let mut iter = buffer.chunks_exact_mut(64);
for block in &mut iter {
cipher.apply_ks_block(&mut block);
mac.update_block(&block[0..16]);
mac.update_block(&block[16..32]);
mac.update_block(&block[32..48]);
mac.update_block(&block[48..64]);
}
Which allows compiler to perform all necessary optimizations for improved performance, such as keeping data in registers without unloading it onto stack.
I think we can start with the non-interleaved version and I will try add an interleaved version for encryption later. As for decryption, I don't think we should perform decryption before auth tag was fully verified, since in some corner cases it may lead to an unintended exposure of plaintext. But I guess it will depend on how much performance benefit the interleaving could give us.
Aha, this nested backend approach makes much more sense! I look forward to it being documented in the traits, but in the meantime I'll give this a go.
Ah, I now see why I didn't stumble upon this approach myself. The StreamCipher
trait is not bound on StreamCipherCore
where process_with_backend
is defined; in fact, StreamCipher
is documented as "Synchronous stream cipher core trait", implying it is the root. The relationship between these two traits should definitely be documented, and if it makes sense for a StreamCipher: StreamCipherCore
bound to exist, that should be added (or the reason why that doesn't exist should be documented).
Yeah, the "core" part in the docs of StreamCipher
is confusing and should be fixed.
if it makes sense for a StreamCipher: StreamCipherCore bound to exist, that should be added (or the reason why that doesn't exist should be documented).
No, it does not make sense. StreamCipher
is a byte-level trait, while StreamCipherCore
is a block-level one (as all other *Core
traits). You "convert" a type which implements StreamCipherCore
into a type which implements it StreamCipher
by wrapping it with StreamCipherCoreWrapper
. This wrapper (similarly to CoreWrapper
from the digest
crate) handles buffering which is necessary for byte-level APIs.
Codecov Report
Base: 81.54% // Head: 80.47% // Decreases project coverage by -1.06%
:warning:
Coverage data is based on head (
ac660d5
) compared to base (2751627
). Patch coverage: 72.89% of modified lines in pull request are covered.
Additional details and impacted files
@@ Coverage Diff @@
## master #414 +/- ##
==========================================
- Coverage 81.54% 80.47% -1.07%
==========================================
Files 32 32
Lines 1284 1439 +155
==========================================
+ Hits 1047 1158 +111
- Misses 237 281 +44
Impacted Files | Coverage Δ | |
---|---|---|
chacha20poly1305/src/lib.rs | 76.92% <ø> (ø) |
|
chacha20poly1305/tests/lib.rs | 100.00% <ø> (ø) |
|
chacha20poly1305/src/cipher.rs | 74.86% <72.89%> (-15.76%) |
:arrow_down: |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
I've switched to using the nested closures approach, and it's still significantly slower. Next step would be to look at the assembly changes between #415 and this PR, to figure out what's going on. Some of it might be the misalignment due to AD, but I don't currently buy that as the entire cause.
@str4d #445 is landed if you want to take a crack at using blocks_needed_until_alignment
Status update: I've tried several times over the last 5 months to get this to work, and even with blocks_needed_until_alignment
the details are horrendous. I'll try and get my in-progress state back into this PR so we can discuss it further.
I will reiterate my previous suggestion from upthread that we should try to get things properly pipelined on x86
/x86_64
without attempting any sort of trait-based abstraction, and then work backwards towards what the design of the trait-based abstraction should be, as clearly we don't quite yet have the right design or even the requirements nailed down properly.
I think I also have an attempt at that somewhere locally (that I started before learning of @newpavlov's trait APIs). I'll see if I can dig it up.
Rebased on latest master
, and made a few tweaks to the impl (de-duplicating handling of complete tail blocks). Next I will force-push with my changes to use blocks_needed_until_alignment
.
Force-pushed to use UhfBackend::blocks_needed_to_align
. I also added a test vector (generated before this PR) for a message that is larger than the SIMD fast path (so it exercises all parts of the new code).
I left a very large comment in the code documenting the alignment problem. Something I have tried in the past, and am considering trying again, is to pull apart one of the ChaCha20Poly1305 assembly implementations to figure out how they deal with the fundamental alignment issue caused by the associated data.
@str4d now that inline ASM is stable, we can also add an asm
feature that uses it