s2n-tls icon indicating copy to clipboard operation
s2n-tls copied to clipboard

On hardware acceleration and prioritizing ChaCha20-Poly1305

Open raycoll opened this issue 6 years ago • 7 comments

Problem

For hardware that supports AES acceleration, AES-GCM is the preferred bulk encryption algorithm in TLS. This is primarily due to performance. For hardware that does not have AES acceleration, ChaCha20-Poly1305 is much faster(~400%) than any AES-based cipher. Since AES-GCM and ChaCha20-Poly1305 are mature encryption modes(and provide equal bits of security), our TLS cipher priority decision between the two can be based only on performance.

Here are some results I've compiled that demonstrate this: https://gist.github.com/raycoll/62a660602b9ec9fb67b6443f16732080

The problem is clients that perform much better with ChaCha20-Poly1305 will effectively never negotiate it with an s2n server because we prefer AES-GCM in current our preference lists.

Example

Assume we have:

client_cipher_prefs = ("ECDHE-ECDSA-CHACHA20-POLY1305", "AES128-SHA", ECDHE-RSA-AES128-SHA", "ECDHE-RSA-AES-128-GCM-SHA256");
server_cipher_prefs = ("ECDHE-RSA-AES-128-GCM-SHA256", "ECDHE-RSA-AES-256-GCM-SHA384", "ECDHE-ECDSA-CHACHA20-POLY1305", "ECDHE-RSA-CHACHA20-POLY1305");

The outcome of this negotiation will be ECDHE-RSA-AES-128-GCM-SHA256, even if the client does not support AES acceleration.

Proposed Solution as a TLS client

During s2n library initialization, check if the CPU has AES acceleration. For x86 we can use cpuinfo. We can add support for other CPU types later once we have a clear idea of how they perform(I'm assuming most . If the CPU has AES acceleration, use a cipher preference list that has AES-GCM cipher suites prioritized. If the CPU does not have AES acceleration, use a cipher preference list that has ChaCha20-Poly1305 prioritized.

Practically, this change will not help much since most servers still use plain server preference and will ignore the fact our client list is at the top. See the next section.

This similar to the approach used by chromium https://codereview.chromium.org/91913002 .

Example

client_cipher_prefs_no_aes = ("ECDHE-ECDSA-CHACHA20-POLY1305", "AES128-SHA", ECDHE-RSA-AES128-SHA", "ECDHE-RSA-AES-128-GCM-SHA256");
client_cipher_prefs_aes = ("ECDHE-RSA-AES-128-GCM-SHA256", "ECDHE-ECDSA-CHACHA20-POLY1305", "AES128-SHA", ECDHE-RSA-AES128-SHA", );

Proposed Solution as a TLS server

Assuming TLS client software adopts the approach in the previous section(Chromium already has it and soon s2n client software will), we can update our server selection algorithm to detect if a client is boosting ChaCha20. When ChaCha20 is preferred in the client, s2n server should also prioritize ChaCha20.

This is similar to an approach used by boringssl via equal preference groups https://boringssl.googlesource.com/boringssl/+/858a88daf27975f67d9f63e18f95645be2886bfb

Example

client_cipher_prefs = ("ECDHE-ECDSA-CHACHA20-POLY1305", "AES128-SHA", ECDHE-RSA-AES128-SHA", "ECDHE-RSA-AES-128-GCM-SHA256");
server_cipher_prefs = ("ECDHE-RSA-AES-128-GCM-SHA256", "ECDHE-RSA-AES-256-GCM-SHA384", "ECDHE-ECDSA-CHACHA20-POLY1305", "ECDHE-RSA-CHACHA20-POLY1305");

The outcome of this negotiation should be ECDHE-ECDSA-CHACHA20-POLY1305

client_cipher_prefs = ("ECDHE-RSA-AES-128-GCM-SHA256", "ECDHE-ECDSA-CHACHA20-POLY1305", "AES128-SHA", ECDHE-RSA-AES128-SHA");
server_cipher_prefs = ("ECDHE-RSA-AES-128-GCM-SHA256", "ECDHE-RSA-AES-256-GCM-SHA384", "ECDHE-ECDSA-CHACHA20-POLY1305", "ECDHE-RSA-CHACHA20-POLY1305");

The outcome of this negotiation should be "ECDHE-RSA-AES-128-GCM-SHA256"

Proposed s2n user facing API

I am inclined to bundle this behavior inside of a new s2n_cipher_preferences version string. The user interface to opt into to this feature is to change the configured version string. If we get positive results from this feature we can consider including it in the default version string in a later release.

raycoll avatar Feb 05 '19 02:02 raycoll

I think the proposed solutions are pretty non controversial since multiple other TLS libraries have implemented something similar. I'm more interested in any feedback on the API.

raycoll avatar Feb 05 '19 02:02 raycoll

As a TLS client, I think flipping preferences based on platform is a simple first step to implement this. However we should keep in mind that some non-x86 platforms may have AES hardware acceleration[1] and do AES-GCM faster than ChaCha20-Poly1305.

For a wacky alternative, what if we run a quick performance test when the s2n library is initialized and choose preferences based on the results?

[1] https://aws.amazon.com/ec2/graviton/

raycoll avatar Dec 10 '19 21:12 raycoll

A way of implementing this to get discussion going https://github.com/awslabs/s2n/pull/1650

raycoll avatar Mar 07 '20 19:03 raycoll

For x86 we can use cpuinfo. We can add support for other CPU types later once we have a clear idea of how they perform

For ARMv8-A, AES HW accel is optional, and I've seen a StackExchange question about a capable SoC (Snapdragon 410 IIRC from 2014) that can support AES well, but due to the devices Android OS/kernel (32-bit I think) /proc/cpuinfo implied it was ARMv7-A instead.

The advised solution for Linux and Android devices was to use HWCAP:

Parsing /proc/cpuinfo is a popular way to detect CPU features. However I strongly recommend not to use /proc/cpuinfo on ARMv8-A for cpu feature detection, as this is not a portable way of detecting CPU features.

While ARMv8-A has been available for quite some time, apparently there are new budget devices in 2020 being released (such as Android Go phones) which still use the Cortex-A53 cores and lack HW acceleration for AES.

polarathene avatar Oct 10 '20 03:10 polarathene

What was the CPU / environment that your linked gist results were calculated on btw? These are results from a desktop system with Intel Skylake i5-6500 (4 cores, 4 threads) running on Manjaro Linux KDE:

## Tested using OpenSSL 1.1.1g April 2020

# Includes AES-NI instructions
$ openssl speed -evp aes-256-gcm                                             
Doing aes-256-gcm for 3s on 16 size blocks: 78233395 aes-256-gcm's in 3.00s
Doing aes-256-gcm for 3s on 64 size blocks: 51950665 aes-256-gcm's in 2.99s
Doing aes-256-gcm for 3s on 256 size blocks: 23471196 aes-256-gcm's in 3.00s
Doing aes-256-gcm for 3s on 1024 size blocks: 8548219 aes-256-gcm's in 3.00s
Doing aes-256-gcm for 3s on 8192 size blocks: 1250424 aes-256-gcm's in 2.99s
Doing aes-256-gcm for 3s on 16384 size blocks: 639218 aes-256-gcm's in 3.00s
OpenSSL 1.1.1g  21 Apr 2020
built on: Sun May 24 19:14:32 2020 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) idea(int) blowfish(ptr) 
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -march=x86-64 -mtune=generic -O2 -pipe -fno-plt -Wa,--noexecstack -D_FORTIFY_SOURCE=2 -march=x86-64 -mtune=generic -O2 -pipe -fno-plt -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG -D_FORTIFY_SOURCE=2
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-256-gcm     417244.77k  1111987.48k  2002875.39k  2917792.09k  3425910.84k  3490982.57k

# Hardware acceleration enabled
# ChaCha20-Poly1305 is ~53% as fast as AES-NI AES256-GCM
$ openssl speed -evp chacha20-poly1305                                       
Doing chacha20-poly1305 for 3s on 16 size blocks: 44388737 chacha20-poly1305's in 2.99s
Doing chacha20-poly1305 for 3s on 64 size blocks: 22329092 chacha20-poly1305's in 2.99s
Doing chacha20-poly1305 for 3s on 256 size blocks: 11470218 chacha20-poly1305's in 3.00s
Doing chacha20-poly1305 for 3s on 1024 size blocks: 5142767 chacha20-poly1305's in 2.99s
Doing chacha20-poly1305 for 3s on 8192 size blocks: 689465 chacha20-poly1305's in 2.99s
Doing chacha20-poly1305 for 3s on 16384 size blocks: 347027 chacha20-poly1305's in 3.00s
OpenSSL 1.1.1g  21 Apr 2020
built on: Sun May 24 19:14:32 2020 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) idea(int) blowfish(ptr) 
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -march=x86-64 -mtune=generic -O2 -pipe -fno-plt -Wa,--noexecstack -D_FORTIFY_SOURCE=2 -march=x86-64 -mtune=generic -O2 -pipe -fno-plt -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG -D_FORTIFY_SOURCE=2
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
chacha20-poly1305   237531.70k   477947.12k   978791.94k  1761268.70k  1888995.75k  1895230.12k

# Disabled AES-NI instructions
# ~90% performance drop compared to AES-NI AES256-GCM
$ OPENSSL_ia32cap="~0x200000200000000" openssl speed -evp aes-256-gcm
Doing aes-256-gcm for 3s on 16 size blocks: 22344250 aes-256-gcm's in 2.99s
Doing aes-256-gcm for 3s on 64 size blocks: 6630966 aes-256-gcm's in 3.00s
Doing aes-256-gcm for 3s on 256 size blocks: 1714084 aes-256-gcm's in 2.99s
Doing aes-256-gcm for 3s on 1024 size blocks: 431993 aes-256-gcm's in 3.00s
Doing aes-256-gcm for 3s on 8192 size blocks: 55084 aes-256-gcm's in 2.99s
Doing aes-256-gcm for 3s on 16384 size blocks: 27548 aes-256-gcm's in 3.00s
OpenSSL 1.1.1g  21 Apr 2020
built on: Sun May 24 19:14:32 2020 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) idea(int) blowfish(ptr) 
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -march=x86-64 -mtune=generic -O2 -pipe -fno-plt -Wa,--noexecstack -D_FORTIFY_SOURCE=2 -march=x86-64 -mtune=generic -O2 -pipe -fno-plt -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG -D_FORTIFY_SOURCE=2
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-256-gcm     119567.89k   141460.61k   146757.69k   147453.61k   150919.11k   150448.81k

# Hardware acceleration disabled.
# ChaCha20-Poly1305 is ~800% faster than non AES-NI AES256-GCM
$ OPENSSL_ia32cap="~0x200000200000000" openssl speed -evp chacha20-poly1305
Doing chacha20-poly1305 for 3s on 16 size blocks: 44536812 chacha20-poly1305's in 3.00s
Doing chacha20-poly1305 for 3s on 64 size blocks: 22467543 chacha20-poly1305's in 2.99s
Doing chacha20-poly1305 for 3s on 256 size blocks: 10350530 chacha20-poly1305's in 3.00s
Doing chacha20-poly1305 for 3s on 1024 size blocks: 2816405 chacha20-poly1305's in 2.99s
Doing chacha20-poly1305 for 3s on 8192 size blocks: 358253 chacha20-poly1305's in 3.00s
Doing chacha20-poly1305 for 3s on 16384 size blocks: 181822 chacha20-poly1305's in 2.99s
OpenSSL 1.1.1g  21 Apr 2020
built on: Sun May 24 19:14:32 2020 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) idea(int) blowfish(ptr) 
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -march=x86-64 -mtune=generic -O2 -pipe -fno-plt -Wa,--noexecstack -D_FORTIFY_SOURCE=2 -march=x86-64 -mtune=generic -O2 -pipe -fno-plt -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG -D_FORTIFY_SOURCE=2
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
chacha20-poly1305   237529.66k   480910.62k   883245.23k   964548.07k   978269.53k   996311.59k

Results breakdown

Just for clarity - Where 100% "faster", is equivalent to 1x of the speed (as in, no difference), but double(2x) of the speed would be 200% "faster", 3x == 300% and so forth.

AES-NI disabled: ChaCha20-Poly1305 vs AES-256-GCM - ChaCha20-Poly1305 is 200-660% faster (average ~500%)

((44536812 / 22344250) + (22467543 / 6630966) + (10350530 / 1714084) + (2816405 / 431993) + (358253 / 55084) + (181822 / 27548)) / 6 
= (2 + 3.4 + 6 + 6.5 + 6.5 + 6.6) / 6 
= 5.17

AES-256-GCM with AES-NI vs AES-256-GCM without - With AES-NI is 350-23,200% faster (average 15,000%)

((78233395 / 22344250) + (51950665 / 6630966) + (23471196 / 1714084) + (8548219 / 431993) + (1250424 / 55084) + (639218 / 27548)) / 6 
= (3.5 + 7.8 + 13.7 + 19.8 + 22.7 + 23.2) / 6 
= 15.12

Or alternatively calculating the other way around: Software only AES-256-GCM is at 4-29% the speed (average 90% loss) of AES-256-GCM with AES-NI

((22344250 / 78233395) + (6630966 / 51950665) + (1714084 / 23471196) + (431993 / 8548219) + (55084 / 1250424) + (27548 / 639218)) / 6 
= (0.286 + 0.128 + 0.073 + 0.051 + 0.044 + 0.043) / 6 
= 0.104

AES-NI enabled: AES-256-GCM vs ChaCha20-Poly1305 - AES-256-GCM with AES-NI is 166-233% faster (average 190%)

((78233395 / 44388737) + (51950665 / 22329092) + (23471196 / 11470218) + (8548219 / 5142767) + (1250424 / 689465) + (639218 / 347027)) / 6 
= (1.8 + 2.3 + 2 + 1.7 + 1.8 + 1.8) / 6 
= 1.90

ChaCha20-Poly1305 (with no OPENSSL_ia32cap env) vs AES-256-GCM no AES-NI - ChaCha20-Poly1305 is 200-12,600% faster (average 800%)

((44388737 / 22344250) + (22329092 / 6630966) + (11470218 / 1714084) + (5142767 / 431993) + (689465 / 55084) + (347027 / 27548)) / 6 
= (2 + 3.4 + 6.7 + 11.9 + 12.5 + 12.6) / 6 
= 8.18

~0x200000200000000 affecting ChaCha20-Poly1305 was unexpected, and differs from your results. Same outcome however even with ~0x000000000000000 (~0), which from what I can make of the OPENSSL_ia32cap docs shouldn't cause the performance degrade. Either or, is nothing to do with the PCLMULQDQ(bit 33) or AES-NI(bit 57) support being disabled. using a value of 0 results in further perf degradation (I assume the ~ is the NOT operator flipping the bits to all 1 except 33 and 57?).

Notably, the 800% faster for ChaCha20-Poly1305 vs software only AES-256-GCM in my results can be a bit misleading, since it really ranges between 2-12x depending on block size. And you're probably more likely to encounter AES-128-GCM is negotiated? Which is a little bit faster than AES-256-GCM too. Presumably the performance ratio also differs on ARM hardware that this would actually be targeting?

polarathene avatar Oct 10 '20 07:10 polarathene

This issue: #851 may be relevant to this problem.

maddeleine avatar Jul 28 '22 17:07 maddeleine

I think #851 works for this problem(negotiate ChaCha20 with clients that signal lack of AES acceleration) in closed environments where there is full control over client and server. I don't think pure client preference is a viable for other use cases because pure client preference will lead to non-optimal and possibly less secure negotiations for clients that currently have...strangely ordered ciphers.

raycoll avatar Jul 28 '22 21:07 raycoll

We recently added Chacha boosting capability to s2n-tls security policy. Once enabled, the server will prioritize Chacha is the client signal preference in the ClientHello. For more details see #3543

zaherd avatar Dec 23 '22 16:12 zaherd