Refactor: Cipher_Mode::finish_msg() using std::span
This aims at removing the std::vector dependency from Cipher_Mode::finish_msg() to avoid unnecessary copies or allocations. Another goal is to factor out the offset parameter handling from the concrete implementations into the Cipher_Mode base class. Note that this is not meant to be public-facing API just yet. I'm focussing on the private customization point in this PR.
The idea is that the new (probably eventually public) method Cipher_Mode::bytes_needed_for_finalization() provides an exact number of bytes that it will need to finalize a cipher mode. With that, the base class (or eventually the application?) allocates enough memory and passes it into finish_msg() as a span. finish_msg() then finalizes the cipher operation in-place and returns the number of bytes of the span that are actually relevant and part of the result. With this, the base class can strip authentication tags from decryption results for instance.
Personally, I find this API functional enough for internal use but fairly involved for a public interface. I'm open for suggestions here. 🙂
coverage: 90.588% (+0.02%) from 90.572% when pulling 79ae69b5e97f84d5fedac9c5a4a590f8b0e33fb4 on Rohde-Schwarz:refactor/span-for-cipher-mode-finalization into e3b3452ec48335e4eaa80e6e1a99e9b16657f6a2 on randombit:master.
This is quite a hefty change, because of significant adaptions in all Cipher_Mode finalization implementations. Though, I think, the modernization is worthwhile. It does save allocations and copies in several places. Also, I find, the use of span makes for easier-to-understand code. There's more room for improvement in some implementations once process_msg also learns to deal with span, for sure.
Unfortunately, I don't see an obvious way to split this up into smaller patches. I'll do a self-review after walking away from it for a few days, though I'd really appreciate more eyes on this.
You may have to rebase for #4776
You may have to rebase for https://github.com/randombit/botan/pull/4776
Done.
Ready for review, from my side. @randombit
I took a look already some time ago and this seemed fine in the outline but I need to do a full review. Next in my queue.
Initial comment, please rebase to pick up the recent clang-tidy CI changes
Initial comment, please rebase to pick up the recent clang-tidy CI changes
Done. One minor conflict in the SIV header. No clang-tidy hiccup. 😎
Personally, I find this API functional enough for internal use but fairly involved for a public interface.
Yeah the interface for cipher modes is really fiddly. Was designed trying to balance usability vs minimizing copies, I'm not sure it really ended up in the right place. But I also have not come up with anything better so far. :/
In general the change looks ok. One thing I'm concerned about is that this seems to be adding quite a bit of overhead - though tbh right now I do not see where.
For example on master
CTR-BE(AES-128) encrypt buffer size 8 bytes: 685.081 MiB/sec 4.17 cycles/byte (342.56 MiB in 500.03 ms)
this branch
CTR-BE(AES-128) encrypt buffer size 8 bytes: 544.767 MiB/sec 5.24 cycles/byte (272.44 MiB in 500.10 ms)
In absolute terms the overhead here is probably quite small, but in cases where the message is small (so finish cost dominates) and the cipher is fast (as with AES-NI) it's quite noticeable. I see it also in other ciphers (eg ~7% regression for AES-128-GCM with 8 byte messages)
In absolute terms the overhead here is probably quite small, but in cases where the message is small (so finish cost dominates) and the cipher is fast (as with AES-NI) it's quite noticeable.
Good point to measure it like that. That's surprising to me as well. I'll look into it, ~~though it may have to wait a bit, unfortunately~~.
Edit: I couldn't wait to take a quick glance after all. Especially the CTR mode becoming slower is very unexpected. The finalization for stream ciphers is essentially a NOOP. Also, in the Stream_Mode base class I moved from calling update(), to process() which actually removes an indirection (albeit one that was likely inlined anyway), and with that a call to std::vector::resize() (that would always be a NOOP, as well).
The only thing I could immediately blame this on is the new virtual method call (bytes_needed_for_finalization()) in the generic implementation of finish(), that is useless for stream ciphers but can't be avoided in our class hierarchy.
A quick test on my M2 MacBook Air over breakfast coffee does not reproduce this finding on my machine, though:
➜ botan git:(master) ✗ ./botan speed --msec=2000 --buf-size=8 "CTR-BE(AES-128)"
CTR-BE(AES-128) encrypt buffer size 8 bytes: 551.325 MiB/sec (1102.688 MiB in 2000.069 ms)
➜ botan git:(refactor/span-for-cipher-mode-finalization) ✗ ./botan speed --msec=2000 --buf-size=8 "CTR-BE(AES-128)"
CTR-BE(AES-128) encrypt buffer size 8 bytes: 551.307 MiB/sec (1102.625 MiB in 2000.021 ms)
... multiple runs jitter between roughly 548MiB/s and 560MiB/s in both cases.
It's very reliable for me on a Tiger Lake laptop. There is some noise to be sure (~5% back and forth depending on the run) but this is more like a 25%
master (Clang 20): CTR-BE(AES-128) encrypt buffer size 8 bytes: 687.218 MiB/sec 4.16 cycles/byte (343.62 MiB in 500.02 ms) CTR-BE(AES-128) decrypt buffer size 8 bytes: 595.302 MiB/sec 4.80 cycles/byte (297.69 MiB in 500.06 ms) CTR-BE(AES-128) encrypt buffer size 16 bytes: 1149.485 MiB/sec 2.48 cycles/byte (574.75 MiB in 500.01 ms) CTR-BE(AES-128) decrypt buffer size 16 bytes: 1050.926 MiB/sec 2.72 cycles/byte (525.50 MiB in 500.04 ms)
here (Clang 20): CTR-BE(AES-128) encrypt buffer size 8 bytes: 553.520 MiB/sec 5.16 cycles/byte (276.81 MiB in 500.10 ms) CTR-BE(AES-128) decrypt buffer size 8 bytes: 479.829 MiB/sec 5.95 cycles/byte (239.94 MiB in 500.05 ms) CTR-BE(AES-128) encrypt buffer size 16 bytes: 980.258 MiB/sec 2.91 cycles/byte (490.19 MiB in 500.06 ms) CTR-BE(AES-128) decrypt buffer size 16 bytes: 863.294 MiB/sec 3.31 cycles/byte (431.69 MiB in 500.05 ms)
master (GCC 15): CTR-BE(AES-128) encrypt buffer size 8 bytes: 843.907 MiB/sec 3.38 cycles/byte (422.00 MiB in 500.06 ms) CTR-BE(AES-128) decrypt buffer size 8 bytes: 695.764 MiB/sec 4.11 cycles/byte (347.94 MiB in 500.08 ms) CTR-BE(AES-128) encrypt buffer size 16 bytes: 1557.379 MiB/sec 1.83 cycles/byte (778.75 MiB in 500.04 ms) CTR-BE(AES-128) decrypt buffer size 16 bytes: 1306.543 MiB/sec 2.19 cycles/byte (653.31 MiB in 500.03 ms)
here (GCC 15): CTR-BE(AES-128) encrypt buffer size 8 bytes: 710.878 MiB/sec 4.02 cycles/byte (355.50 MiB in 500.09 ms) CTR-BE(AES-128) decrypt buffer size 8 bytes: 597.441 MiB/sec 4.78 cycles/byte (298.75 MiB in 500.05 ms) CTR-BE(AES-128) encrypt buffer size 16 bytes: 1341.603 MiB/sec 2.13 cycles/byte (670.81 MiB in 500.01 ms) CTR-BE(AES-128) decrypt buffer size 16 bytes: 1138.683 MiB/sec 2.51 cycles/byte (569.38 MiB in 500.03 ms)
I'll check on another machine
Did some experimenting with disabling different parts of finish to see what the overhead is. The issue is definitely with the virtual calls. Basically I'm seeing a ~~ 25% perf drop (in this one specific small message/fast cipher case) and if I comment out
if(final_input_bytes < minimum_final_size()) {
throw Invalid_Argument("final input message is too small");
}
I get ~12% back.
And if I replace
const auto final_buffer_bytes = bytes_needed_for_finalization(final_input_bytes);
// Make room for additional overhead to be produced during finalization
if(final_input_bytes < final_buffer_bytes) {
final_block.resize(offset + final_buffer_bytes);
}
with
const size_t final_buffer_bytes = final_input_bytes;
(only valid for stream modes obviously)
I get back the other ~12%
So it seems the virtual calls are more costly on this laptop that on your machine. Or possibly the M2's AES hardware is not fast enough that the overhead becomes visible? IDK
So one immediate suggestion that seems honestly an improvement anyway, regardless of the overhead issue:
Each implementation of bytes_needed_for_finalization should itself check if the input value is less than minimum_final_size and if so throw. Then 2 virtual calls becomes 1 (since bytes_needed_for_finalization occurs in final leaf classes and so the compiler can devirtualize the minimum_final_size call)
Each implementation of
bytes_needed_for_finalizationshould itself check if the input value is less thanminimum_final_sizeand if so throw. Then 2 virtual calls becomes 1 (sincebytes_needed_for_finalizationoccurs in final leaf classes and so the compiler can devirtualize theminimum_final_sizecall)
Very much a fair point!
I did what you suggested (in this commit: https://github.com/randombit/botan/pull/4880/commits/79ae69b5e97f84d5fedac9c5a4a590f8b0e33fb4) and it turns out that most implementations of bytes_needed_for_finalization didn't even need to be adapted. Either because minimum_final_size was 'zero' anyway or because they already contained the check. The new function calculate_final_input_bytes (implemented in the *.cpp) now takes care of both the offset check (mentioned here) and also a debug-assert as a safety net for the implementation specific check in bytes_needed_for_finalization.
Interestingly, the debug-assert forced me to adapt a few negative tests that were meant to trigger an Invalid_State exception but ran into the Invalid_Argument instead. 🙄 I pulled these test adaptions into an independent pull request and rebased this PR after merging it.
As a general idea for consideration: The concrete implementations (for the modes of operation but also other primitives) are anyway hidden in internal headers. So, in theory, we should be able to refactor them using 'static polymorphy' behind the public base class interfaces (similar to pcurves). That way, many algorithm-specific traits (block size, tag length, intrinsic parallelism, ...) could become compile-time constants and thus cheap to access. One day, perhaps...
Rebased after merging #4970. I'm happy to report that the stricter exception expectations did uncover another issue in commoncrypto that could have potentially caused actual harm!
Due to a missing "key is set" check in CommonCrypto_Cipher_Mode::bytes_needed_for_finalization a call to finish() on an unkeyed object called CCCryptorGetOutputLength with a nullptr handle. This resulted in an unreasonably large "expected output length" (probably some UB). The generic Cipher_Mode::finish() implementation then tried to allocate a huge vector causing some STL exception to be thrown.
Before #4970 this STL exception was 'good enough', now it was flagged as unexpected. 😎
As a general idea for consideration: The concrete implementations (for the modes of operation but also other primitives) are anyway hidden in internal headers. So, in theory, we should be able to refactor them using 'static polymorphy' behind the public base class interfaces (similar to pcurves). That way, many algorithm-specific traits (block size, tag length, intrinsic parallelism, ...) could become compile-time constants and thus cheap to access.
Another (completely orthogonal) idea I had while thinking about this PR is that Cipher_Mode was ill-designed and it would have been much better to directly express encryption vs decryption modes in the API, rather than covering both operations with the same API. Since only encryption ever expands, while decryption contracts, likely we could offer dedicated APIs that are simpler for each respective usage.
This is still fixable, since happily all of the cipher implementations are themselves private. And at least unlike a refactoring of an existing API like here, it can be done incrementally one mode at a time. But it should probably happen as part of a larger rethink of how cipher operations are expressed (see mistakes.rst)
... it would have been much better to directly express encryption vs decryption modes in the API, rather than covering both operations with the same API. Since only encryption ever expands, while decryption contracts, likely we could offer dedicated APIs that are simpler for each respective usage.
Yes, full ack. I spent the morning trying to come up with a decent public API for a new Cipher_Mode::finish and got lost in the details. From what I saw, those are the complexities it would have to handle (essentially all at once):
- Encryption
- Addition of the tag (usually at the end, but sometimes in the front, e.g. SIV)
- Addition of a padding, or ciphertext-stealing-style finalizations
- Sometimes both, e.g. CBC+MAC respectively MAC+CBC in TLS 1.2
- Decryption
- Reading/Verifying the tag, then removing it
- Reading/Verifying paddings or CTS's, then maybe removing it
- In general, usually mixed with the above
- Buffered modes producing the entire ciphertext at the end, even when using
process()orupdate(), along with padding/tag addition/removal
- Buffered modes producing the entire ciphertext at the end, even when using
At this point, I would probably just expose an equivalent version of finish_msg() for users that want to avoid the allocations/copies (e.g. https://github.com/randombit/botan/issues/4697) that may come with the existing std::vector-based version. It's not very ergonomic but I doubt its going to get much better anyway. And for most people, the existing API is likely just fine for now.
@randombit from my side this is ready to go in. Or am I missing something?