openssl icon indicating copy to clipboard operation
openssl copied to clipboard

Implement AES CFB128 with Intel AES-NI and VAES

Open stanciuadrian opened this issue 9 months ago • 32 comments

Phases:

  1. Refactoring: extract the AES CFB functionality to cipher_aes_cfb* files
  • Move PROV_CIPHER_HW_aes_mode(cfb...) generator macros from cipher_aes_hw.c to cipher_aes_cfb_hw.c
  • Move CFB code from cipher_aes_hw_s390x.inc to cipher_aes_cfb_hw_s390x.inc
  • Move ossl_prov_cipher_hw_aes_cfb declarations from cipher_aes.h to cipher_aes_cfb.h
  • Move CFB code from cipher_aes_hw_aesni.inc to cipher_aes_cfb_hw_aesni.inc and drop paths not relevant to EVP_CIPH_CFB_MODE.
  • Move CFB code from cipher_aes_hw.c to cipher_aes_cfb_hw.c and drop paths not relevant to EVP_CIPH_CFB_MODE.
  1. Add VAES feature detection and dispatching. AES-NI is required for aesni_set_encrypt_key.

  2. Add serial AES-CFB128 encryption with AVX-512 and VAES

  3. Add parallel AES-CFB128 decryption with AVX-512 and VAES

  4. Hard-code vbroadcasti32x4 source operand size for nasm/masm translation.

openssl speed speedups have been obtained with:

$ LD_LIBRARY_PATH=`pwd` taskset -c 40 ./apps/openssl speed -evp AES-128-CFB
$ LD_LIBRARY_PATH=`pwd` taskset -c 40 ./apps/openssl speed -evp AES-192-CFB
$ LD_LIBRARY_PATH=`pwd` taskset -c 40 ./apps/openssl speed -evp AES-256-CFB

$ LD_LIBRARY_PATH=`pwd` taskset -c 40 ./apps/openssl speed -evp AES-128-CFB --decrypt
$ LD_LIBRARY_PATH=`pwd` taskset -c 40 ./apps/openssl speed -evp AES-192-CFB --decrypt
$ LD_LIBRARY_PATH=`pwd` taskset -c 40 ./apps/openssl speed -evp AES-256-CFB --decrypt
Encryption Block Size (bytes) Speedup
AES-128-CFB 16 1.45
64 1.63
256 1.68
1024 1.70
8192 1.70
16384 1.70
AES-192-CFB 16 1.39
64 1.54
256 1.58
1024 1.59
8192 1.59
16384 1.59
AES-256-CFB 16 1.34
64 1.47
256 1.50
1024 1.51
8192 1.52
16384 1.52
Decryption Block Size (bytes) Speedup
AES-128-CFB 16 1.28
64 5.05
256 16.58
1024 23.19
8192 23.17
16384 23.18
AES-192-CFB 16 1.39
64 5.01
256 15.47
1024 21.36
8192 21.32
16384 21.35
AES-256-CFB 16 1.52
64 4.90
256 15.64
1024 19.99
8192 19.96
16384 19.97

Setup: Intel Sapphire Rapids SKU, benchmark pinned to 40th core, Turbo Boost off Results should be interpreted in the context of the openssl speed benchmark.

Checklist:

  • [X] CHANGES is updated
  • [X] Apply https://github.com/openssl/openssl/pull/27078
  • [X] Figure out AVX-512 support in masm and enable it on Windows.

stanciuadrian avatar Feb 25 '25 19:02 stanciuadrian

I have enabled the AVX-512 encryption for AES-CFB128. @t8m could you please enable the CI workflows ?

I'd like to check the encryption half on all targets.

stanciuadrian avatar Mar 21 '25 11:03 stanciuadrian

Pushed code-style & nasm/masm fixes. @t8m could you please enable the CI workflows one more time? Thanks!

stanciuadrian avatar Mar 22 '25 17:03 stanciuadrian

@t8m, thanks for your quick response!

The 32-bit Windows build now uses CRYPTO_cfb128_encrypt with the AES-NI backend. Could you please enable the CI workflows one more time?

stanciuadrian avatar Mar 23 '25 10:03 stanciuadrian

@t8m, I have added 2 fixes for nasm/masm on Windows.

Could you please enable the CI workflows one more time?

stanciuadrian avatar Mar 30 '25 22:03 stanciuadrian

It seems this has different author e-mail address than what we have in the CLA database. Could you please amend the author e-mail?

t8m avatar Apr 01 '25 09:04 t8m

@paulidale This pull request includes a parallel implementation of the AES-CFB128 decryption. I would appreciate it if you could review the changes, the performance, and provide feedback.

Improvements should be visible in both single and multi-threaded scenarios.

Thanks in advance!

stanciuadrian avatar Apr 02 '25 14:04 stanciuadrian

Hi @stanciuadrian - I am reviewing the code now (it might take me a while to review).

Is there any chance you could get this peer reviewed by other intel folks?

slontis avatar Apr 16 '25 05:04 slontis

Is there any chance you could get this peer reviewed by other intel folks?

Thanks for your time! Yes, it will be reviewed internally too. I will mark it as draft until then.

stanciuadrian avatar Apr 16 '25 06:04 stanciuadrian

Pending internal review by Intel folks.

stanciuadrian avatar Apr 16 '25 06:04 stanciuadrian

Note also that I am not sure that the CI runs any of this code.. Which is why I started doing PR #27109

slontis avatar Apr 16 '25 23:04 slontis

Can it be included in a future 3.5 release

wbeck10 avatar Apr 17 '25 04:04 wbeck10

Currently performance fixes are in general not backported to old branches unless an OTC vote (by exception) allows it..

slontis avatar Apr 17 '25 04:04 slontis

it looks like the last commits are not relevant to this pr?

slontis avatar Apr 22 '25 01:04 slontis

Were block sizes of 32 considered? i.e using aesenc ymm ?

In the current implementation the processing has 4 stages:

  • process sets of 16 blocks until fewer than 16 remain
  • process sets of 4 blocks until fewer than 4 remain; may run up to 3 times
  • (missing) process sets of 2 blocks until fewer than 2 remain; may run at most once
  • process blocks individually; may run up to 3 times
  • process a partial block

A situation requiring processing 32 bytes would be rare, would only run once per payload, and the corresponding code would always increase the code footprint.

stanciuadrian avatar Apr 22 '25 05:04 stanciuadrian

it looks like the last commits are not relevant to this pr?

The tooling commit adds support for older assemblers not supporting AVX-512. The 32-bit commit attempts to fix a linker error in 32-bit builds. Will squash them once I get the CI passing.

I notice the ilammy/setup-nasm@v1 step on 32-bit builds keeps failing. nasm.us doesn't resolve anymore and might have expired. Could you please check the CI ?

stanciuadrian avatar Apr 22 '25 06:04 stanciuadrian

Note also that I am not sure that the CI runs any of this code.. Which is why I started doing PR #27109

One way to find out is to run ./apps/openssl version -a and decode the bits of OPENSSL_ia32cap. Running SDE with a proper SKU (e.g.: -spr) should enable this path.

To make sure the dispatcher works, one may look at the instruction mix histogram.

Having SDE in the CI is a great way to validate different execution paths when hardware is not available.

stanciuadrian avatar Apr 22 '25 07:04 stanciuadrian

Did you consider leaving the partial block support as c code?.. and just call the assembler to do full block operations. That would seem to be simpler for what I imagine to be an uncommon edge case.. The first block could then just be a general aesni call.

Indeed, could have made the perlasm simpler.

The baseline C implementation from CRYPTO_cfb128_encrypt contains 3 stages:

  • pre: only partial XOR "blend"
  • mid: many iterations of AES-encrypt 16-byte block with XOR "blend"
  • post: AES-encrypt single 16-byte block with partial XOR "blend"

A complete assembly port of this function brings some advantages:

  • coalesces byte-level loads/stores with masked ops during partial block processing in pre and post stages
  • pre and post stages are basically unrolled, without any flow control instructions
  • avoids indirect calls by inlining the AES encryption routine; helps with code locality
  • caches the AES key schedule before mid and reuses it in post

stanciuadrian avatar Apr 22 '25 08:04 stanciuadrian

Looks like https://www.nasm.us is down currently..

slontis avatar Apr 22 '25 08:04 slontis

Looks like https://www.nasm.us is down currently..

Pending https://github.com/openssl/openssl/pull/27462

stanciuadrian avatar Apr 22 '25 09:04 stanciuadrian

Note: You can use

git log (to list commit ids - it doesnt need to be the last commit). git commit --fixup commit_id

These get auto squashed when the final merge happens. This avoids force pushing and is a bit easier to review the changes.

slontis avatar Apr 25 '25 22:04 slontis

These get auto squashed when the final merge happens.

I have only force pushed to rebase over master. All the feedback was applied in new commits.

stanciuadrian avatar Apr 25 '25 22:04 stanciuadrian

Maybe a comment that links to this diagram might be helpful for someone trying to understand the decryption code..

Do you think this might be too wide to be included as a comment ?

(basic ASCII)

#     +----+                    +----------+                +----------+                +----------+              
#     | iv |                    | cipher 0 |                | cipher 1 |                | cipher 2 |              
#     +--+-+                    +----+-----+                +----+-----+                +----+-----+              
#        |                           |                           |                           |                    
#        |                           |                           |                           |                    
# +------v------+             +------v------+             +------v------+             +------v------+             
# | AES encrypt |             | AES encrypt |             | AES encrypt |             | AES encrypt |             
# |  with  key  |             |  with  key  |             |  with  key  |             |  with  key  |             
# +------+------+             +------+------+             +------+------+             +------+------+             
#        |                           |                           |                           |                    
#        |                           |                           |                           |                    
#     +--v--+     +----------+    +--v--+     +----------+    +--v--+     +----------+    +--v--+     +----------+
#     | XOR <-----+ cipher 0 |    | XOR <-----+ cipher 1 |    | XOR <-----+ cipher 2 |    | XOR <-----+ cipher 3 |
#     +--+--+     +----------+    +--+--+     +----------+    +--+--+     +----------+    +--+--+     +----------+
#        |                           |                           |                           |                    
#        |                           |                           |                           |                    
#   +----v----+                 +----v----+                 +----v----+                 +----v----+               
#   | plain 0 |                 | plain 1 |                 | plain 2 |                 | plain 3 |               
#   +---------+                 +---------+                 +---------+                 +---------+               

(extended ASCII)

#     ┌────┐                    ┌──────────┐                ┌──────────┐                ┌──────────┐              
#     │ iv │                    │ cipher 0 │                │ cipher 1 │                │ cipher 2 │              
#     └──┬─┘                    └────┬─────┘                └────┬─────┘                └────┬─────┘              
#        │                           │                           │                           │                    
#        │                           │                           │                           │                    
# ┌──────▼──────┐             ┌──────▼──────┐             ┌──────▼──────┐             ┌──────▼──────┐             
# │ AES encrypt │             │ AES encrypt │             │ AES encrypt │             │ AES encrypt │             
# │  with  key  │             │  with  key  │             │  with  key  │             │  with  key  │             
# └──────┬──────┘             └──────┬──────┘             └──────┬──────┘             └──────┬──────┘             
#        │                           │                           │                           │                    
#        │                           │                           │                           │                    
#     ┌──▼──┐     ┌──────────┐    ┌──▼──┐     ┌──────────┐    ┌──▼──┐     ┌──────────┐    ┌──▼──┐     ┌──────────┐
#     │ XOR ◄─────┼ cipher 0 │    │ XOR ◄─────┼ cipher 1 │    │ XOR ◄─────┼ cipher 2 │    │ XOR ◄─────┼ cipher 3 │
#     └──┬──┘     └──────────┘    └──┬──┘     └──────────┘    └──┬──┘     └──────────┘    └──┬──┘     └──────────┘
#        │                           │                           │                           │                    
#        │                           │                           │                           │                    
#   ┌────▼────┐                 ┌────▼────┐                 ┌────▼────┐                 ┌────▼────┐               
#   │ plain 0 │                 │ plain 1 │                 │ plain 2 │                 │ plain 3 │               
#   └─────────┘                 └─────────┘                 └─────────┘                 └─────────┘               

stanciuadrian avatar May 04 '25 21:05 stanciuadrian

I think it's worth including the plain ASCII version as a comment somewhere. The width is high but not excessive.

paulidale avatar May 04 '25 22:05 paulidale

Is there an ETA on when this will go back out of draft form? I cant imagine anyone else reviewing this PR until that point. I have finished reviewing the current PR. See comments.

Thanks for reviewing it ! I don't have an ETA for it, maybe 1-2 weeks.

stanciuadrian avatar May 14 '25 13:05 stanciuadrian

might be helpful for someone trying to understand the decryption code

Added ASCII diagrams for encryption and decryption and more comments. Please let me know if they're detailed enough.

stanciuadrian avatar May 14 '25 16:05 stanciuadrian

Is this still a draft?

slontis avatar May 28 '25 00:05 slontis

Is this still a draft?

Everything is in place and ready to go forward.

stanciuadrian avatar May 29 '25 08:05 stanciuadrian

Can you please squash the commits into one or two commits? I think the initial refactoring could be kept as separate commit but please squash the rest.

t8m avatar May 29 '25 14:05 t8m

Can you please squash the commits into one or two commits? I think the initial refactoring could be kept as separate commit but please squash the rest.

Yes, it's a good idea to have 2 commits for this PR. I have squashed and redistributed the changes between them.

Please let me know if there's anything else I need to do.

stanciuadrian avatar May 30 '25 17:05 stanciuadrian

ossl_aes_cfb128_vaes_eligible was also called for AES-CFB1 and AES-CFB8:

// gcc -E providers/implementations/ciphers/cipher_aes_cfb_hw.c -Iinclude -Iproviders/implementations/include

static const PROV_CIPHER_HW aes_cfb128 = {cipher_hw_aes_initkey, ossl_cipher_hw_generic_cfb128, cipher_hw_aes_copyctx};
static const PROV_CIPHER_HW aesni_cfb128 = {cipher_hw_aesni_initkey, ossl_cipher_hw_generic_cfb128, cipher_hw_aes_copyctx};
static const PROV_CIPHER_HW vaes_cfb128 = {cipher_hw_aesni_initkey, aes_cfb128_vaes_encdec_wrapper, cipher_hw_aes_copyctx};
const PROV_CIPHER_HW *ossl_prov_cipher_hw_aes_cfb128(size_t keybits)
{
    if (AESNI_CAPABLE)
    {
        if (ossl_aes_cfb128_vaes_eligible()) // OK
            return &vaes_cfb128;
        return &aesni_cfb128;
    }
    return &aes_cfb128;
}
static const PROV_CIPHER_HW aes_cfb1 = {cipher_hw_aes_initkey, ossl_cipher_hw_generic_cfb1, cipher_hw_aes_copyctx};
static const PROV_CIPHER_HW aesni_cfb1 = {cipher_hw_aesni_initkey, ossl_cipher_hw_generic_cfb1, cipher_hw_aes_copyctx};
static const PROV_CIPHER_HW vaes_cfb1 = {cipher_hw_aesni_initkey, ossl_cipher_hw_generic_cfb1, cipher_hw_aes_copyctx};
const PROV_CIPHER_HW *ossl_prov_cipher_hw_aes_cfb1(size_t keybits)
{
    if (AESNI_CAPABLE)
    {
        if (ossl_aes_cfb128_vaes_eligible()) // not OK
            return &vaes_cfb1;
        return &aesni_cfb1;
    }
    return &aes_cfb1;
}
static const PROV_CIPHER_HW aes_cfb8 = {cipher_hw_aes_initkey, ossl_cipher_hw_generic_cfb8, cipher_hw_aes_copyctx};
static const PROV_CIPHER_HW aesni_cfb8 = {cipher_hw_aesni_initkey, ossl_cipher_hw_generic_cfb8, cipher_hw_aes_copyctx};
static const PROV_CIPHER_HW vaes_cfb8 = {cipher_hw_aesni_initkey, ossl_cipher_hw_generic_cfb8, cipher_hw_aes_copyctx};
const PROV_CIPHER_HW *ossl_prov_cipher_hw_aes_cfb8(size_t keybits)
{
    if (AESNI_CAPABLE)
    {
        if (ossl_aes_cfb128_vaes_eligible()) // not OK
            return &vaes_cfb8;
        return &aesni_cfb8;
    }
    return &aes_cfb8;
}

The proposed patch (in the 3rd commit, to be squashed if agreed upon) adds dummy vaes_eligible implementations for the 2 algorithms:

static int ossl_aes_cfb8_vaes_eligible(void) { return 0; }
static int ossl_aes_cfb1_vaes_eligible(void) { return 0; }

The dispatcher condition becomes if (ossl_aes_##mode##_vaes_eligible()).

(Edit: return 1 works too; the aes_cfb* and vaes_cfb* structures for AES-CFB1 and AES-CFB8 are the same).

stanciuadrian avatar Jun 11 '25 14:06 stanciuadrian