`chacha20poly1305`: Implement one-pass encryption and decryption

Open str4d opened this issue 2 years ago • 15 comments

Builds on https://github.com/RustCrypto/AEADs/pull/415.

Apr 21 '22 16:04 str4d

Currently this still makes things slower. I've benchmarked per-commit, as migrating to universal-hash 0.5 on its own makes things faster (I presume because we now have a single runtime check for the entire message).

Upgrade to `universal-hash 0.5` and latest `poly1305`:

chacha20poly1305/encrypt/1024                                                                             
                        time:   [5260.8729 cycles 5262.1619 cycles 5263.5785 cycles]
                        thrpt:  [5.1402 cpb 5.1388 cpb 5.1376 cpb]
                 change:
                        time:   [-11.713% -11.620% -11.539%] (p = 0.00 < 0.05)
                        thrpt:  [+13.044% +13.147% +13.267%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  4 (4.00%) high severe
chacha20poly1305/decrypt/1024                                                                             
                        time:   [5007.6309 cycles 5009.2505 cycles 5011.0551 cycles]
                        thrpt:  [4.8936 cpb 4.8918 cpb 4.8903 cpb]
                 change:
                        time:   [+3.3672% +3.4391% +3.5197%] (p = 0.00 < 0.05)
                        thrpt:  [-3.4000% -3.3247% -3.2575%]
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe
chacha20poly1305/encrypt/2048                                                                             
                        time:   [6629.7546 cycles 6631.6433 cycles 6633.7002 cycles]
                        thrpt:  [3.2391 cpb 3.2381 cpb 3.2372 cpb]
                 change:
                        time:   [-10.277% -10.201% -10.131%] (p = 0.00 < 0.05)
                        thrpt:  [+11.273% +11.360% +11.454%]
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe
chacha20poly1305/decrypt/2048                                                                             
                        time:   [5200.1415 cycles 5201.3475 cycles 5202.6369 cycles]
                        thrpt:  [2.5404 cpb 2.5397 cpb 2.5391 cpb]
                 change:
                        time:   [-1.8221% -1.7513% -1.6664%] (p = 0.00 < 0.05)
                        thrpt:  [+1.6946% +1.7825% +1.8559%]
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  5 (5.00%) high severe
chacha20poly1305/encrypt/4096                                                                             
                        time:   [8549.8252 cycles 8551.9567 cycles 8554.4275 cycles]
                        thrpt:  [2.0885 cpb 2.0879 cpb 2.0874 cpb]
                 change:
                        time:   [-22.983% -22.956% -22.930%] (p = 0.00 < 0.05)
                        thrpt:  [+29.753% +29.797% +29.842%]
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe
chacha20poly1305/decrypt/4096                                                                             
                        time:   [5001.7972 cycles 5004.1684 cycles 5007.4453 cycles]
                        thrpt:  [1.2225 cpb 1.2217 cpb 1.2211 cpb]
                 change:
                        time:   [-24.395% -24.356% -24.316%] (p = 0.00 < 0.05)
                        thrpt:  [+32.128% +32.198% +32.267%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe
chacha20poly1305/encrypt/8192                                                                             
                        time:   [19351.4972 cycles 19354.6604 cycles 19357.8118 cycles]
                        thrpt:  [2.3630 cpb 2.3626 cpb 2.3622 cpb]
                 change:
                        time:   [-17.125% -17.063% -17.007%] (p = 0.00 < 0.05)
                        thrpt:  [+20.492% +20.573% +20.664%]
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
chacha20poly1305/decrypt/8192                                                                             
                        time:   [11706.8281 cycles 11709.9539 cycles 11713.9894 cycles]
                        thrpt:  [1.4299 cpb 1.4294 cpb 1.4291 cpb]
                 change:
                        time:   [-22.190% -22.131% -22.064%] (p = 0.00 < 0.05)
                        thrpt:  [+28.310% +28.420% +28.518%]
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  4 (4.00%) high severe
chacha20poly1305/encrypt/16384                                                                             
                        time:   [23738.0720 cycles 23740.9454 cycles 23744.8765 cycles]
                        thrpt:  [1.4493 cpb 1.4490 cpb 1.4489 cpb]
                 change:
                        time:   [-44.422% -44.391% -44.358%] (p = 0.00 < 0.05)
                        thrpt:  [+79.721% +79.827% +79.927%]
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) high mild
  8 (8.00%) high severe
chacha20poly1305/decrypt/16384                                                                             
                        time:   [6387.6400 cycles 6389.3812 cycles 6391.6555 cycles]
                        thrpt:  [0.3901 cpb 0.3900 cpb 0.3899 cpb]
                 change:
                        time:   [-76.379% -76.369% -76.355%] (p = 0.00 < 0.05)
                        thrpt:  [+322.93% +323.17% +323.36%]
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low severe
  8 (8.00%) high severe

Using one-pass encryption (relative to `universal-hash 0.5`):

chacha20poly1305/encrypt/1024                                                                             
                        time:   [8605.9437 cycles 8608.5776 cycles 8611.5313 cycles]
                        thrpt:  [8.4097 cpb 8.4068 cpb 8.4042 cpb]
                 change:
                        time:   [+63.506% +63.635% +63.754%] (p = 0.00 < 0.05)
                        thrpt:  [-38.933% -38.888% -38.840%]
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe
chacha20poly1305/decrypt/1024                                                                             
                        time:   [10972.4717 cycles 10976.5257 cycles 10982.2696 cycles]
                        thrpt:  [10.7249 cpb 10.7193 cpb 10.7153 cpb]
                 change:
                        time:   [+118.89% +119.07% +119.25%] (p = 0.00 < 0.05)
                        thrpt:  [-54.390% -54.352% -54.315%]
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe
chacha20poly1305/encrypt/2048                                                                             
                        time:   [14190.3986 cycles 14195.1670 cycles 14200.3840 cycles]
                        thrpt:  [6.9338 cpb 6.9312 cpb 6.9289 cpb]
                 change:
                        time:   [+114.05% +114.18% +114.32%] (p = 0.00 < 0.05)
                        thrpt:  [-53.340% -53.311% -53.282%]
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe
chacha20poly1305/decrypt/2048                                                                              
                        time:   [16990.9948 cycles 17000.9798 cycles 17011.2346 cycles]
                        thrpt:  [8.3063 cpb 8.3013 cpb 8.2964 cpb]
                 change:
                        time:   [+226.30% +226.56% +226.78%] (p = 0.00 < 0.05)
                        thrpt:  [-69.399% -69.378% -69.353%]
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
chacha20poly1305/encrypt/4096                                                                             
                        time:   [24100.7668 cycles 24114.2547 cycles 24130.6399 cycles]
                        thrpt:  [5.8913 cpb 5.8873 cpb 5.8840 cpb]
                 change:
                        time:   [+181.86% +181.99% +182.12%] (p = 0.00 < 0.05)
                        thrpt:  [-64.554% -64.538% -64.521%]
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  2 (2.00%) low severe
  3 (3.00%) low mild
  8 (8.00%) high severe
chacha20poly1305/decrypt/4096                                                                             
                        time:   [28341.4014 cycles 28359.3332 cycles 28380.0748 cycles]
                        thrpt:  [6.9287 cpb 6.9237 cpb 6.9193 cpb]
                 change:
                        time:   [+466.13% +466.49% +466.83%] (p = 0.00 < 0.05)
                        thrpt:  [-82.358% -82.347% -82.336%]
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe
chacha20poly1305/encrypt/8192                                                                             
                        time:   [41099.8173 cycles 41126.4633 cycles 41165.2353 cycles]
                        thrpt:  [5.0251 cpb 5.0203 cpb 5.0171 cpb]
                 change:
                        time:   [+112.45% +112.77% +113.23%] (p = 0.00 < 0.05)
                        thrpt:  [-53.102% -53.001% -52.929%]
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) high mild
  8 (8.00%) high severe
chacha20poly1305/decrypt/8192                                                                             
                        time:   [51776.6583 cycles 51815.2619 cycles 51881.6442 cycles]
                        thrpt:  [6.3332 cpb 6.3251 cpb 6.3204 cpb]
                 change:
                        time:   [+341.97% +342.29% +342.60%] (p = 0.00 < 0.05)
                        thrpt:  [-77.406% -77.390% -77.374%]
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe
chacha20poly1305/encrypt/16384                                                                             
                        time:   [82441.3891 cycles 82444.0295 cycles 82447.2164 cycles]
                        thrpt:  [5.0322 cpb 5.0320 cpb 5.0318 cpb]
                 change:
                        time:   [+246.92% +247.07% +247.21%] (p = 0.00 < 0.05)
                        thrpt:  [-71.199% -71.188% -71.175%]
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe
chacha20poly1305/decrypt/16384                                                                             
                        time:   [100814.9950 cycles 100820.8676 cycles 100828.7900 cycles]
                        thrpt:  [6.1541 cpb 6.1536 cpb 6.1533 cpb]
                 change:
                        time:   [+1476.9% +1477.7% +1478.4%] (p = 0.00 < 0.05)
                        thrpt:  [-93.664% -93.662% -93.659%]
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) high mild
  10 (10.00%) high severe

Apr 21 '22 16:04 str4d

I think it might be worth experimenting with adding some (unsafe) x86/x86_64 platform-specific APIs to chacha20 and poly1305 which operate directly on SIMD registers, eschewing trait-based abstractions, and seeing what the performance difference is versus the current implementation with universal-hash v0.5.

Apr 23 '22 15:04 tarcieri

It's not surprising that you get degraded performance with such core loops. Not only you use the stream cipher with built-in buffering which is not needed here, but also each call to apply_keystream and update_padded checks target features at runtime.

You should use the block-level ChaCha type and the code should look roughly like this (it's a very rough draft to demonstrate the idea):

let mut cipher: ChaChaCore<U10> = ...;
let mut mac: Poly1305 = ...;
// For simplification I will use closure syntax here instead of
// the explicit implementation of the closure traits
cipher.process_with_backend(|cipher_backend| {
    mac.update_with_backend(|mac_backend| {
        // process AAD with mac_backend
        ...
        // least common multiple of number of bytes which
        // can be processed by each backend in parallel
        let lcm_block_size = lcm(
            cipher_backend::BlockSize::USIZE * cipher_backend::ParBlocksSize::USIZE,
            mac_backend::BlockSize::USIZE * mac_backend::ParBlocksSize::USIZE,
        );
        let mut iter = buffer.chunks_exact_mut(lcm_block_size);
        for lcm_block in &mut iter {
            // we do not have the `cast` helper function right now.
            // it's probably will be worth to add it later to one
            // of our traits or utils crates
            let cipher_blocks: &mut [ChaChaParBlocks] = cast_mut(lcm_block);
            let mut par_ks_blocks: ChaChaParBlocks = Default::default();
            for par_blocks in cipher_blocks {
                cipher_backend.gen_par_ks_blocks(&mut par_ks_blocks);
                xor(par_blocks, &par_ks_blocks);
            }
            
            let mac_blocks: &[Poly1305ParBlocks] = cast(lcm_block);
            for par_blocks in cipher_blocks {
                mac_backend.proc_par_blocks(par_blocks);
            }
        }
        let tail = iter.into_remainder();
        // process tail bytes
        ...
    })
});

The idea is that compiler after inlining the callbacks will generate the core loop for each combination of supported backends. For hypothetical backends which do not support parallel processing (i.e. ParBlocksSize is equal to U1 for both cipher_backend and mac_backend) the core part of the closures will collapse to code looking roughly like this (of course, it's not a literal code, just demonstration of the idea):

let mut iter = buffer.chunks_exact_mut(64);
for block in &mut iter {
    cipher.apply_ks_block(&mut block);
    mac.update_block(&block[0..16]);
    mac.update_block(&block[16..32]);
    mac.update_block(&block[32..48]);
    mac.update_block(&block[48..64]);
}

Which allows compiler to perform all necessary optimizations for improved performance, such as keeping data in registers without unloading it onto stack.

I think we can start with the non-interleaved version and I will try add an interleaved version for encryption later. As for decryption, I don't think we should perform decryption before auth tag was fully verified, since in some corner cases it may lead to an unintended exposure of plaintext. But I guess it will depend on how much performance benefit the interleaving could give us.

Apr 23 '22 18:04 newpavlov

Aha, this nested backend approach makes much more sense! I look forward to it being documented in the traits, but in the meantime I'll give this a go.

Apr 23 '22 20:04 str4d

Ah, I now see why I didn't stumble upon this approach myself. The StreamCipher trait is not bound on StreamCipherCore where process_with_backend is defined; in fact, StreamCipher is documented as "Synchronous stream cipher core trait", implying it is the root. The relationship between these two traits should definitely be documented, and if it makes sense for a StreamCipher: StreamCipherCore bound to exist, that should be added (or the reason why that doesn't exist should be documented).

Apr 24 '22 08:04 str4d

Yeah, the "core" part in the docs of StreamCipher is confusing and should be fixed.

if it makes sense for a StreamCipher: StreamCipherCore bound to exist, that should be added (or the reason why that doesn't exist should be documented).

No, it does not make sense. StreamCipher is a byte-level trait, while StreamCipherCore is a block-level one (as all other *Core traits). You "convert" a type which implements StreamCipherCore into a type which implements it StreamCipher by wrapping it with StreamCipherCoreWrapper. This wrapper (similarly to CoreWrapper from the digest crate) handles buffering which is necessary for byte-level APIs.

Apr 24 '22 11:04 newpavlov

Codecov Report

Base: 81.54% // Head: 80.47% // Decreases project coverage by -1.06% :warning:

Coverage data is based on head (ac660d5) compared to base (2751627). Patch coverage: 72.89% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #414      +/-   ##
==========================================
- Coverage   81.54%   80.47%   -1.07%     
==========================================
  Files          32       32              
  Lines        1284     1439     +155     
==========================================
+ Hits         1047     1158     +111     
- Misses        237      281      +44

Impacted Files	Coverage Δ
chacha20poly1305/src/lib.rs	`76.92% <ø> (ø)`
chacha20poly1305/tests/lib.rs	`100.00% <ø> (ø)`
chacha20poly1305/src/cipher.rs	`74.86% <72.89%> (-15.76%)`	:arrow_down:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

Apr 24 '22 16:04 codecov-commenter

I've switched to using the nested closures approach, and it's still significantly slower. Next step would be to look at the assembly changes between #415 and this PR, to figure out what's going on. Some of it might be the misalignment due to AD, but I don't currently buy that as the entire cause.

Apr 24 '22 16:04 str4d

@str4d #445 is landed if you want to take a crack at using blocks_needed_until_alignment

Jul 24 '22 19:07 tarcieri

Status update: I've tried several times over the last 5 months to get this to work, and even with blocks_needed_until_alignment the details are horrendous. I'll try and get my in-progress state back into this PR so we can discuss it further.

Jan 04 '23 23:01 str4d

I will reiterate my previous suggestion from upthread that we should try to get things properly pipelined on x86/x86_64 without attempting any sort of trait-based abstraction, and then work backwards towards what the design of the trait-based abstraction should be, as clearly we don't quite yet have the right design or even the requirements nailed down properly.

Jan 04 '23 23:01 tarcieri

I think I also have an attempt at that somewhere locally (that I started before learning of @newpavlov's trait APIs). I'll see if I can dig it up.

Jan 04 '23 23:01 str4d

Rebased on latest master, and made a few tweaks to the impl (de-duplicating handling of complete tail blocks). Next I will force-push with my changes to use blocks_needed_until_alignment.

Jan 07 '23 14:01 str4d

Force-pushed to use UhfBackend::blocks_needed_to_align. I also added a test vector (generated before this PR) for a message that is larger than the SIMD fast path (so it exercises all parts of the new code).

I left a very large comment in the code documenting the alignment problem. Something I have tried in the past, and am considering trying again, is to pull apart one of the ChaCha20Poly1305 assembly implementations to figure out how they deal with the fundamental alignment issue caused by the associated data.

Jan 07 '23 17:01 str4d

@str4d now that inline ASM is stable, we can also add an asm feature that uses it

Jan 08 '23 03:01 tarcieri

AEADs AEADs copied to clipboard

`chacha20poly1305`: Implement one-pass encryption and decryption

Codecov Report

AEADs
AEADs copied to clipboard