openssl
openssl copied to clipboard
Implement AES CFB128 with Intel AES-NI and VAES
Phases:
- Refactoring: extract the AES CFB functionality to
cipher_aes_cfb*files
- Move
PROV_CIPHER_HW_aes_mode(cfb...)generator macros fromcipher_aes_hw.ctocipher_aes_cfb_hw.c - Move CFB code from
cipher_aes_hw_s390x.inctocipher_aes_cfb_hw_s390x.inc - Move
ossl_prov_cipher_hw_aes_cfbdeclarations fromcipher_aes.htocipher_aes_cfb.h - Move CFB code from
cipher_aes_hw_aesni.inctocipher_aes_cfb_hw_aesni.incand drop paths not relevant toEVP_CIPH_CFB_MODE. - Move CFB code from
cipher_aes_hw.ctocipher_aes_cfb_hw.cand drop paths not relevant toEVP_CIPH_CFB_MODE.
-
Add VAES feature detection and dispatching. AES-NI is required for
aesni_set_encrypt_key. -
Add serial AES-CFB128 encryption with AVX-512 and VAES
-
Add parallel AES-CFB128 decryption with AVX-512 and VAES
-
Hard-code
vbroadcasti32x4source operand size for nasm/masm translation.
openssl speed speedups have been obtained with:
$ LD_LIBRARY_PATH=`pwd` taskset -c 40 ./apps/openssl speed -evp AES-128-CFB
$ LD_LIBRARY_PATH=`pwd` taskset -c 40 ./apps/openssl speed -evp AES-192-CFB
$ LD_LIBRARY_PATH=`pwd` taskset -c 40 ./apps/openssl speed -evp AES-256-CFB
$ LD_LIBRARY_PATH=`pwd` taskset -c 40 ./apps/openssl speed -evp AES-128-CFB --decrypt
$ LD_LIBRARY_PATH=`pwd` taskset -c 40 ./apps/openssl speed -evp AES-192-CFB --decrypt
$ LD_LIBRARY_PATH=`pwd` taskset -c 40 ./apps/openssl speed -evp AES-256-CFB --decrypt
| Encryption | Block Size (bytes) | Speedup |
|---|---|---|
| AES-128-CFB | 16 | 1.45 |
| 64 | 1.63 | |
| 256 | 1.68 | |
| 1024 | 1.70 | |
| 8192 | 1.70 | |
| 16384 | 1.70 | |
| AES-192-CFB | 16 | 1.39 |
| 64 | 1.54 | |
| 256 | 1.58 | |
| 1024 | 1.59 | |
| 8192 | 1.59 | |
| 16384 | 1.59 | |
| AES-256-CFB | 16 | 1.34 |
| 64 | 1.47 | |
| 256 | 1.50 | |
| 1024 | 1.51 | |
| 8192 | 1.52 | |
| 16384 | 1.52 |
| Decryption | Block Size (bytes) | Speedup |
|---|---|---|
| AES-128-CFB | 16 | 1.28 |
| 64 | 5.05 | |
| 256 | 16.58 | |
| 1024 | 23.19 | |
| 8192 | 23.17 | |
| 16384 | 23.18 | |
| AES-192-CFB | 16 | 1.39 |
| 64 | 5.01 | |
| 256 | 15.47 | |
| 1024 | 21.36 | |
| 8192 | 21.32 | |
| 16384 | 21.35 | |
| AES-256-CFB | 16 | 1.52 |
| 64 | 4.90 | |
| 256 | 15.64 | |
| 1024 | 19.99 | |
| 8192 | 19.96 | |
| 16384 | 19.97 |
Setup: Intel Sapphire Rapids SKU, benchmark pinned to 40th core, Turbo Boost off
Results should be interpreted in the context of the openssl speed benchmark.
Checklist:
- [X] CHANGES is updated
- [X] Apply https://github.com/openssl/openssl/pull/27078
- [X] Figure out AVX-512 support in masm and enable it on Windows.
I have enabled the AVX-512 encryption for AES-CFB128. @t8m could you please enable the CI workflows ?
I'd like to check the encryption half on all targets.
Pushed code-style & nasm/masm fixes. @t8m could you please enable the CI workflows one more time? Thanks!
@t8m, thanks for your quick response!
The 32-bit Windows build now uses CRYPTO_cfb128_encrypt with the AES-NI backend.
Could you please enable the CI workflows one more time?
@t8m, I have added 2 fixes for nasm/masm on Windows.
Could you please enable the CI workflows one more time?
It seems this has different author e-mail address than what we have in the CLA database. Could you please amend the author e-mail?
@paulidale This pull request includes a parallel implementation of the AES-CFB128 decryption. I would appreciate it if you could review the changes, the performance, and provide feedback.
Improvements should be visible in both single and multi-threaded scenarios.
Thanks in advance!
Hi @stanciuadrian - I am reviewing the code now (it might take me a while to review).
Is there any chance you could get this peer reviewed by other intel folks?
Is there any chance you could get this peer reviewed by other intel folks?
Thanks for your time! Yes, it will be reviewed internally too. I will mark it as draft until then.
Pending internal review by Intel folks.
Note also that I am not sure that the CI runs any of this code.. Which is why I started doing PR #27109
Can it be included in a future 3.5 release
Currently performance fixes are in general not backported to old branches unless an OTC vote (by exception) allows it..
it looks like the last commits are not relevant to this pr?
Were block sizes of 32 considered? i.e using aesenc ymm ?
In the current implementation the processing has 4 stages:
- process sets of 16 blocks until fewer than 16 remain
- process sets of 4 blocks until fewer than 4 remain; may run up to 3 times
- (missing) process sets of 2 blocks until fewer than 2 remain; may run at most once
- process blocks individually; may run up to 3 times
- process a partial block
A situation requiring processing 32 bytes would be rare, would only run once per payload, and the corresponding code would always increase the code footprint.
it looks like the last commits are not relevant to this pr?
The tooling commit adds support for older assemblers not supporting AVX-512.
The 32-bit commit attempts to fix a linker error in 32-bit builds.
Will squash them once I get the CI passing.
I notice the ilammy/setup-nasm@v1 step on 32-bit builds keeps failing.
nasm.us doesn't resolve anymore and might have expired. Could you please check the CI ?
Note also that I am not sure that the CI runs any of this code.. Which is why I started doing PR #27109
One way to find out is to run ./apps/openssl version -a and decode the bits of OPENSSL_ia32cap.
Running SDE with a proper SKU (e.g.: -spr) should enable this path.
To make sure the dispatcher works, one may look at the instruction mix histogram.
Having SDE in the CI is a great way to validate different execution paths when hardware is not available.
Did you consider leaving the partial block support as c code?.. and just call the assembler to do full block operations. That would seem to be simpler for what I imagine to be an uncommon edge case.. The first block could then just be a general aesni call.
Indeed, could have made the perlasm simpler.
The baseline C implementation from CRYPTO_cfb128_encrypt contains 3 stages:
pre: only partial XOR "blend"mid: many iterations of AES-encrypt 16-byte block with XOR "blend"post: AES-encrypt single 16-byte block with partial XOR "blend"
A complete assembly port of this function brings some advantages:
- coalesces byte-level loads/stores with masked ops during partial block processing in
preandpoststages preandpoststages are basically unrolled, without any flow control instructions- avoids indirect calls by inlining the AES encryption routine; helps with code locality
- caches the AES key schedule before
midand reuses it inpost
Looks like https://www.nasm.us is down currently..
Looks like https://www.nasm.us is down currently..
Pending https://github.com/openssl/openssl/pull/27462
Note: You can use
git log (to list commit ids - it doesnt need to be the last commit). git commit --fixup commit_id
These get auto squashed when the final merge happens. This avoids force pushing and is a bit easier to review the changes.
These get auto squashed when the final merge happens.
I have only force pushed to rebase over master.
All the feedback was applied in new commits.
Maybe a comment that links to this diagram might be helpful for someone trying to understand the decryption code..
Do you think this might be too wide to be included as a comment ?
(basic ASCII)
# +----+ +----------+ +----------+ +----------+
# | iv | | cipher 0 | | cipher 1 | | cipher 2 |
# +--+-+ +----+-----+ +----+-----+ +----+-----+
# | | | |
# | | | |
# +------v------+ +------v------+ +------v------+ +------v------+
# | AES encrypt | | AES encrypt | | AES encrypt | | AES encrypt |
# | with key | | with key | | with key | | with key |
# +------+------+ +------+------+ +------+------+ +------+------+
# | | | |
# | | | |
# +--v--+ +----------+ +--v--+ +----------+ +--v--+ +----------+ +--v--+ +----------+
# | XOR <-----+ cipher 0 | | XOR <-----+ cipher 1 | | XOR <-----+ cipher 2 | | XOR <-----+ cipher 3 |
# +--+--+ +----------+ +--+--+ +----------+ +--+--+ +----------+ +--+--+ +----------+
# | | | |
# | | | |
# +----v----+ +----v----+ +----v----+ +----v----+
# | plain 0 | | plain 1 | | plain 2 | | plain 3 |
# +---------+ +---------+ +---------+ +---------+
(extended ASCII)
# ┌────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
# │ iv │ │ cipher 0 │ │ cipher 1 │ │ cipher 2 │
# └──┬─┘ └────┬─────┘ └────┬─────┘ └────┬─────┘
# │ │ │ │
# │ │ │ │
# ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
# │ AES encrypt │ │ AES encrypt │ │ AES encrypt │ │ AES encrypt │
# │ with key │ │ with key │ │ with key │ │ with key │
# └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
# │ │ │ │
# │ │ │ │
# ┌──▼──┐ ┌──────────┐ ┌──▼──┐ ┌──────────┐ ┌──▼──┐ ┌──────────┐ ┌──▼──┐ ┌──────────┐
# │ XOR ◄─────┼ cipher 0 │ │ XOR ◄─────┼ cipher 1 │ │ XOR ◄─────┼ cipher 2 │ │ XOR ◄─────┼ cipher 3 │
# └──┬──┘ └──────────┘ └──┬──┘ └──────────┘ └──┬──┘ └──────────┘ └──┬──┘ └──────────┘
# │ │ │ │
# │ │ │ │
# ┌────▼────┐ ┌────▼────┐ ┌────▼────┐ ┌────▼────┐
# │ plain 0 │ │ plain 1 │ │ plain 2 │ │ plain 3 │
# └─────────┘ └─────────┘ └─────────┘ └─────────┘
I think it's worth including the plain ASCII version as a comment somewhere. The width is high but not excessive.
Is there an ETA on when this will go back out of draft form? I cant imagine anyone else reviewing this PR until that point. I have finished reviewing the current PR. See comments.
Thanks for reviewing it ! I don't have an ETA for it, maybe 1-2 weeks.
might be helpful for someone trying to understand the decryption code
Added ASCII diagrams for encryption and decryption and more comments. Please let me know if they're detailed enough.
Is this still a draft?
Is this still a draft?
Everything is in place and ready to go forward.
Can you please squash the commits into one or two commits? I think the initial refactoring could be kept as separate commit but please squash the rest.
Can you please squash the commits into one or two commits? I think the initial refactoring could be kept as separate commit but please squash the rest.
Yes, it's a good idea to have 2 commits for this PR. I have squashed and redistributed the changes between them.
Please let me know if there's anything else I need to do.
ossl_aes_cfb128_vaes_eligible was also called for AES-CFB1 and AES-CFB8:
// gcc -E providers/implementations/ciphers/cipher_aes_cfb_hw.c -Iinclude -Iproviders/implementations/include
static const PROV_CIPHER_HW aes_cfb128 = {cipher_hw_aes_initkey, ossl_cipher_hw_generic_cfb128, cipher_hw_aes_copyctx};
static const PROV_CIPHER_HW aesni_cfb128 = {cipher_hw_aesni_initkey, ossl_cipher_hw_generic_cfb128, cipher_hw_aes_copyctx};
static const PROV_CIPHER_HW vaes_cfb128 = {cipher_hw_aesni_initkey, aes_cfb128_vaes_encdec_wrapper, cipher_hw_aes_copyctx};
const PROV_CIPHER_HW *ossl_prov_cipher_hw_aes_cfb128(size_t keybits)
{
if (AESNI_CAPABLE)
{
if (ossl_aes_cfb128_vaes_eligible()) // OK
return &vaes_cfb128;
return &aesni_cfb128;
}
return &aes_cfb128;
}
static const PROV_CIPHER_HW aes_cfb1 = {cipher_hw_aes_initkey, ossl_cipher_hw_generic_cfb1, cipher_hw_aes_copyctx};
static const PROV_CIPHER_HW aesni_cfb1 = {cipher_hw_aesni_initkey, ossl_cipher_hw_generic_cfb1, cipher_hw_aes_copyctx};
static const PROV_CIPHER_HW vaes_cfb1 = {cipher_hw_aesni_initkey, ossl_cipher_hw_generic_cfb1, cipher_hw_aes_copyctx};
const PROV_CIPHER_HW *ossl_prov_cipher_hw_aes_cfb1(size_t keybits)
{
if (AESNI_CAPABLE)
{
if (ossl_aes_cfb128_vaes_eligible()) // not OK
return &vaes_cfb1;
return &aesni_cfb1;
}
return &aes_cfb1;
}
static const PROV_CIPHER_HW aes_cfb8 = {cipher_hw_aes_initkey, ossl_cipher_hw_generic_cfb8, cipher_hw_aes_copyctx};
static const PROV_CIPHER_HW aesni_cfb8 = {cipher_hw_aesni_initkey, ossl_cipher_hw_generic_cfb8, cipher_hw_aes_copyctx};
static const PROV_CIPHER_HW vaes_cfb8 = {cipher_hw_aesni_initkey, ossl_cipher_hw_generic_cfb8, cipher_hw_aes_copyctx};
const PROV_CIPHER_HW *ossl_prov_cipher_hw_aes_cfb8(size_t keybits)
{
if (AESNI_CAPABLE)
{
if (ossl_aes_cfb128_vaes_eligible()) // not OK
return &vaes_cfb8;
return &aesni_cfb8;
}
return &aes_cfb8;
}
The proposed patch (in the 3rd commit, to be squashed if agreed upon) adds dummy vaes_eligible implementations for the 2 algorithms:
static int ossl_aes_cfb8_vaes_eligible(void) { return 0; }
static int ossl_aes_cfb1_vaes_eligible(void) { return 0; }
The dispatcher condition becomes if (ossl_aes_##mode##_vaes_eligible()).
(Edit: return 1 works too; the aes_cfb* and vaes_cfb* structures for AES-CFB1 and AES-CFB8 are the same).