mbedtls
mbedtls copied to clipboard
AES and RSA (bn_mul) optimizations for Visual Studio 64bit
Description
The changes improve mbedtls performance when compiling with Visual Studio for 64bit by adding code paths that use intrinsic functions, particularly:
- AES encryption/decryption using AES-NI
- bignum multiplication by using 128bit umul and adc instructions
Also switch to TSC timing as QueryPerformanceCounter resolution is too low and it doesn't return CPU cycles.
Status
READY/IN DEVELOPMENT
Requires Backporting
NO
- This PR is a new feature\enhancement
Migrations
NO
Additional comments
I ran the compat.sh tests (with some modification that are not part of this PR) on Windows in WSL/bash environment and they do pass. Although I noticed some instabilities, related to shutting down the client (./tests/compat.sh: 1052: kill: No such process) but those are not related to this PR.
As tests are not run as CI under Windows (afaiu), I'm interested how to ensure ongoing correctness of the code I'm submitting.
Also, and it's related to testing, as at least the current versions of clang and probably gcc appear to support intrinsic functions for AES-NI, CLMUL and ADC - I'm wondering if switching inline assembly to intrinsics would be feasible while deduplicating code for other than msvc/windows compilers too? I'm a bit worried about possible compiler/optimizer issues though.
Todos
- [ ] Tests
- [ ] Documentation
- [ ] Changelog updated
- [ ] Backported
Steps to test or reproduce
On Windows 64bit, compiled with Visual Studio, observe numbers from .\programs\test\benchmark.exe without and with the PR. Selected numbers on my machine:
before
AES-CBC-128 : 203930 KiB/s, 15 cycles/byte
AES-CBC-192 : 171486 KiB/s, 18 cycles/byte
AES-CBC-256 : 159248 KiB/s, 19 cycles/byte
AES-XTS-128 : 175882 KiB/s, 17 cycles/byte
AES-XTS-256 : 138802 KiB/s, 22 cycles/byte
AES-GCM-128 : 95727 KiB/s, 33 cycles/byte
AES-GCM-192 : 90064 KiB/s, 35 cycles/byte
AES-GCM-256 : 85229 KiB/s, 37 cycles/byte
AES-CCM-128 : 102824 KiB/s, 31 cycles/byte
AES-CCM-192 : 90658 KiB/s, 35 cycles/byte
AES-CCM-256 : 80782 KiB/s, 39 cycles/byte
CTR_DRBG (NOPR) : 174899 KiB/s, 18 cycles/byte
CTR_DRBG (PR) : 121910 KiB/s, 26 cycles/byte
RSA-2048 : 8674 public/s
RSA-2048 : 212 private/s
RSA-4096 : 2173 public/s
RSA-4096 : 33 private/s
after
AES-CBC-128 : 432099 KiB/s, 7 cycles/byte
AES-CBC-192 : 385591 KiB/s, 8 cycles/byte
AES-CBC-256 : 360928 KiB/s, 9 cycles/byte
AES-XTS-128 : 407371 KiB/s, 8 cycles/byte
AES-XTS-256 : 336302 KiB/s, 9 cycles/byte
AES-GCM-128 : 190789 KiB/s, 17 cycles/byte
AES-GCM-192 : 171782 KiB/s, 18 cycles/byte
AES-GCM-256 : 173023 KiB/s, 18 cycles/byte
AES-CCM-128 : 260800 KiB/s, 12 cycles/byte
AES-CCM-192 : 241114 KiB/s, 13 cycles/byte
CTR_DRBG (NOPR) : 397238 KiB/s, 8 cycles/byte
CTR_DRBG (PR) : 262009 KiB/s, 12 cycles/byte
RSA-2048 : 19703 public/s
RSA-2048 : 458 private/s
RSA-4096 : 5197 public/s
RSA-4096 : 76 private/s
AWESOME! :D I have a question. Is AES-NI not supported in 32 bit mode?
@mrsshr No, it is not.
Hi @orlx,
thanks alot for your contribution! :+1: We will look into it and come back to you afterwards.
Kind regards, Hanno
@mrsshr the inline assembler code for GCC and Clang appears to be 32-bit clean (It even limits itself to the first six FPU regs!).
Given the proper ISA support, it works without issues in protected mode. 🤷🏻♂️
@hanno-arm any update on this?
We are now converting older PRs to draft PRs where the following conditions are met: They have not been updated in the last 3 months, and they need more than non-trivial work to complete.