cryptography icon indicating copy to clipboard operation
cryptography copied to clipboard

Significant multi-threaded performance degradation for encryption

Open mtnking opened this issue 3 years ago • 2 comments

Using multiple threads on a system with multiple cores performs noticeably worse than a single thread in the same environment for encryption.

This effect does not seem to impact decryption (which does passably scale), only encryption (RSA and AES tested), suggesting it's more than simply "python / python threads suck".

A similar benchmark using Rust scales appropriately, so it's likely not OpenSSL.

(* Versions of Python, cryptography, cffi, pip, and setuptools you're using) python 3.9.7 cryptography 36.0.1 cffi 1.15.0 pip3 20.3.4 setuptools 52.0.0

(* How you installed cryptography) pip3 install

(* Clear steps for reproducing your bug)

import os
import time
import concurrent.futures
from cryptography.hazmat.primitives.asymmetric import padding as pad
from cryptography.hazmat.primitives.ciphers import Cipher, algorithms, modes
from cryptography.hazmat.primitives import hashes, padding
from cryptography.hazmat.primitives.asymmetric import rsa

NUM_THREADS = os.cpu_count()

def process1(pubkey):

    message = b"................................"
    iv = bytes(16)
    sessionkey = bytes(32)
    #pad the data to AES block size
    padder = padding.PKCS7(256).padder()
    padded_data = padder.update(message) + padder.finalize()
    #encrypt whee
    cipher = Cipher(algorithms.AES(sessionkey), modes.CBC(iv))
    encryptor = cipher.encryptor()
    DB = encryptor.update(padded_data) + encryptor.finalize()

    #crypto.layerEncrypt(node.pvtkey, message)
    # PubKey encrypt the SessionKey and Initialization Vector
    sessionKeyIV = bytes(48)#sessionkey + iv
    SKIV = pubkey.encrypt(
        sessionKeyIV,
        pad.OAEP(
            mgf=pad.MGF1(algorithm=hashes.SHA256()),
            algorithm=hashes.SHA256(),
            label=None
        )
    )

def main():
    pvtKey = rsa.generate_private_key(
            public_exponent=65537,
            key_size=2048,
        )
    pubkey = pvtKey.public_key()

    start = time.time()
    for i in range(10000):
        process1(pubkey)
    end = time.time()
    print(end-start)

    with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
        time.sleep(1)

        start = time.time()
        future_to_url = {executor.submit(process1, pubkey): i for i in range(10000)}
        executor.shutdown(wait=True)

    end = time.time()

    print(end-start)

if __name__ == '__main__':
    main()

mtnking avatar Feb 24 '22 15:02 mtnking

Are you sure this is cryptography introduced overhead? I started trying to minimize this and I found that if I make:

def process1(pubkey):
    pass

there's still a large performance difference between them! This makes it very hard to tell if the problem is in padding, symmetric encryption, or RSA encryption.

alex avatar Feb 24 '22 23:02 alex

Perhaps this will be more convincing (it's the same script w/ decryption included), the same encryption is way worse with threads, the same decryption is way better with threads:

import os
import time
import concurrent.futures
from cryptography.hazmat.primitives.asymmetric import padding as pad
from cryptography.hazmat.primitives.ciphers import Cipher, algorithms, modes
from cryptography.hazmat.primitives import hashes, padding
from cryptography.hazmat.primitives.asymmetric import rsa

NUM_THREADS = os.cpu_count()/2

def process1(pubkey):
    message = b"................................"
    iv = bytes(16)
    sessionkey = bytes(32)
    #pad the data to AES block size
    padder = padding.PKCS7(256).padder()
    padded_data = padder.update(message) + padder.finalize()
    #encrypt whee
    cipher = Cipher(algorithms.AES(sessionkey), modes.CBC(iv))
    encryptor = cipher.encryptor()
    DB = encryptor.update(padded_data) + encryptor.finalize()

    #crypto.layerEncrypt(node.pvtkey, message)
    # PubKey encrypt the SessionKey and Initialization Vector
    sessionKeyIV = bytes(48)#sessionkey + iv
    SKIV = pubkey.encrypt(
        sessionKeyIV,
        pad.OAEP(
            mgf=pad.MGF1(algorithm=hashes.SHA256()),
            algorithm=hashes.SHA256(),
            label=None
        )
    )

def prep(pubkey):
    message = b"................................"
    iv = bytes(16)
    sessionkey = bytes(32)
    #pad the data to AES block size
    padder = padding.PKCS7(256).padder()
    padded_data = padder.update(message) + padder.finalize()
    #encrypt whee
    cipher = Cipher(algorithms.AES(sessionkey), modes.CBC(iv))
    encryptor = cipher.encryptor()
    DB = encryptor.update(padded_data) + encryptor.finalize()

    #crypto.layerEncrypt(node.pvtkey, message)
    # PubKey encrypt the SessionKey and Initialization Vector
    sessionKeyIV = bytes(48)#sessionkey + iv
    SKIV = pubkey.encrypt(
        sessionKeyIV,
        pad.OAEP(
            mgf=pad.MGF1(algorithm=hashes.SHA256()),
            algorithm=hashes.SHA256(),
            label=None
        )
    )
    return(SKIV, DB)

def process2(PvtKey, SKIV, DB):
    sessionKeyIV = PvtKey.decrypt(
        SKIV,
        pad.OAEP(
            mgf=pad.MGF1(algorithm=hashes.SHA256()),
            algorithm=hashes.SHA256(),
            label=None
        )
    )
    sessionKey = sessionKeyIV[:32]
    iv = sessionKeyIV[32:]
    AESciphertext = DB

    #decrypt the body via AES
    cipher = Cipher(algorithms.AES(sessionKey), modes.CBC(iv))
    decryptor = cipher.decryptor()
    DB = decryptor.update(AESciphertext) + decryptor.finalize()

def main():
    pvtKey = rsa.generate_private_key(
            public_exponent=65537,
            key_size=2048,
        )
    pubkey = pvtKey.public_key()

    start = time.time()
    for i in range(10000):
        process1(pubkey)
    end = time.time()
    print(end-start)

    with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
        time.sleep(1)

        start = time.time()
        future_to_url = {executor.submit(process1, pubkey): i for i in range(10000)}
        executor.shutdown(wait=True)

    end = time.time()

    print(end-start)

    (SKIV, DB) = prep(pubkey)

    start = time.time()
    for i in range(10000):
        process2(pvtKey, SKIV, DB)
    end = time.time()
    print(end-start)

    with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
        time.sleep(1)

        start = time.time()
        future_to_url = {executor.submit(process2, pvtKey, SKIV, DB): i for i in range(10000)}
        executor.shutdown(wait=True)

    end = time.time()

    print(end-start)

if __name__ == '__main__':
    main()

mtnking avatar Feb 25 '22 03:02 mtnking

There's a lot of variables here, but here's the analysis we came to:

RSA encryption and decryption have different performance. Decryption is slower (because its a private key operation). We release the GIL during calls into OpenSSL. Being slower means more time is spent with the GIL released. This means more decryption ops benefit more from multiple threads.

This, combined, with the general overhead of the GIL in Python is our best explanation for what's going on, and it's not clear how we could resolve this. So for now we're going to close this as wontfix.

alex avatar Oct 10 '22 20:10 alex