wolfssl icon indicating copy to clipboard operation
wolfssl copied to clipboard

AES RISC-V 64-bit ASM: ECB/CBC/CTR/GCM/CCM

Open SparkiDev opened this issue 1 year ago • 2 comments

Description

Add implementations of AES for ECB/CBC/CTR/GCM/CCM for RISC-V using assembly. Assembly with standard/scalar cryptography/vector cryptographt instructions.

Testing

./configure --enable-all --enable-riscv-asm

Checklist

  • [x] added tests
  • [ ] updated/added doxygen
  • [ ] updated appropriate READMEs
  • [ ] Updated manual and documentation

SparkiDev avatar May 22 '24 09:05 SparkiDev

retest this please

SparkiDev avatar May 22 '24 11:05 SparkiDev

@SparkiDev I've got a few RISC-V targets here now, so I will try this on actual HW.

dgarske avatar May 22 '24 15:05 dgarske

HiFive Unleashed at 1.4GHz The new asm is like 50 times faster

./configure --enable-riscv-asm && make

root@HiFiveU:~/wolfssl-riscv# ./wolfcrypt/benchmark/benchmark -aes-cbc -aes-gcm------------------------------------------------------------------------------
 wolfSSL version 5.7.0
------------------------------------------------------------------------------
Math:   Multi-Precision: Wolf(SP) word-size=64 bits=3072 sp_int.c
wolfCrypt Benchmark (block bytes 1048576, min 1.0 sec each)
AES-128-CBC-enc             20 MiB took 1.076 seconds,   18.588 MiB/s
AES-128-CBC-dec             20 MiB took 1.083 seconds,   18.473 MiB/s
AES-192-CBC-enc             20 MiB took 1.245 seconds,   16.062 MiB/s
AES-192-CBC-dec             20 MiB took 1.246 seconds,   16.047 MiB/s
AES-256-CBC-enc             15 MiB took 1.057 seconds,   14.189 MiB/s
AES-256-CBC-dec             15 MiB took 1.055 seconds,   14.212 MiB/s
AES-128-GCM-enc             15 MiB took 1.300 seconds,   11.543 MiB/s
AES-128-GCM-dec             15 MiB took 1.300 seconds,   11.535 MiB/s
AES-192-GCM-enc             15 MiB took 1.425 seconds,   10.526 MiB/s
AES-192-GCM-dec             15 MiB took 1.425 seconds,   10.523 MiB/s
AES-256-GCM-enc             10 MiB took 1.032 seconds,    9.687 MiB/s
AES-256-GCM-dec             10 MiB took 1.032 seconds,    9.691 MiB/s
GMAC Table 4-bit            31 MiB took 1.025 seconds,   30.251 MiB/s
Benchmark complete

On master

./configure —enable-all && make

root@HiFiveU:~/wolfssl# ./wolfcrypt/benchmark/benchmark -aes-cbc -aes-gcm
------------------------------------------------------------------------------
 wolfSSL version 5.7.0
------------------------------------------------------------------------------
Math:   Multi-Precision: Wolf(SP) word-size=64 bits=4096 sp_int.c
wolfCrypt Benchmark (block bytes 1048576, min 1.0 sec each)
AES-128-CBC-enc              5 MiB took 12.798 seconds,    0.391 MiB/s
AES-128-CBC-dec              5 MiB took 12.672 seconds,    0.395 MiB/s
AES-192-CBC-enc              5 MiB took 15.301 seconds,    0.327 MiB/s
AES-192-CBC-dec              5 MiB took 15.181 seconds,    0.329 MiB/s
AES-256-CBC-enc              5 MiB took 17.820 seconds,    0.281 MiB/s
AES-256-CBC-dec              5 MiB took 17.669 seconds,    0.283 MiB/s
AES-128-GCM-enc              5 MiB took 12.870 seconds,    0.388 MiB/s
AES-128-GCM-dec              5 MiB took 12.870 seconds,    0.388 MiB/s
AES-192-GCM-enc              5 MiB took 15.375 seconds,    0.325 MiB/s
AES-192-GCM-dec              5 MiB took 15.376 seconds,    0.325 MiB/s
AES-256-GCM-enc              5 MiB took 17.878 seconds,    0.280 MiB/s
AES-256-GCM-dec              5 MiB took 17.896 seconds,    0.279 MiB/s
AES-128-GCM-STREAM-enc       5 MiB took 12.878 seconds,    0.388 MiB/s
AES-128-GCM-STREAM-dec       5 MiB took 12.878 seconds,    0.388 MiB/s
AES-192-GCM-STREAM-enc       5 MiB took 15.379 seconds,    0.325 MiB/s
AES-192-GCM-STREAM-dec       5 MiB took 15.385 seconds,    0.325 MiB/s
AES-256-GCM-STREAM-enc       5 MiB took 17.881 seconds,    0.280 MiB/s
AES-256-GCM-STREAM-dec       5 MiB took 17.888 seconds,    0.280 MiB/s
GMAC Table 4-bit            30 MiB took 1.006 seconds,   29.831 MiB/s
Benchmark complete

dgarske avatar May 26 '24 14:05 dgarske

./configure --enable-all --enable-riscv-asm

wolfcrypt/src/aes.c: In function '_AesXtsHelper':
wolfcrypt/src/aes.c:12631:16: error: implicit declaration of function '_AesEcbEncrypt'; did you mean 'wc_AesEcbEncrypt'? [-Werror=implicit-function-declaration]
12631 |         return _AesEcbEncrypt(aes, out, out, totalSz);
      |                ^~~~~~~~~~~~~~
      |                wc_AesEcbEncrypt
wolfcrypt/src/aes.c:12631:16: error: nested extern declaration of '_AesEcbEncrypt' [-Werror=nested-externs]
wolfcrypt/src/aes.c:12634:16: error: implicit declaration of function '_AesEcbDecrypt'; did you mean 'wc_AesEcbDecrypt'? [-Werror=implicit-function-declaration]
12634 |         return _AesEcbDecrypt(aes, out, out, totalSz);
      |                ^~~~~~~~~~~~~~
      |                wc_AesEcbDecrypt
wolfcrypt/src/aes.c:12634:16: error: nested extern declaration of '_AesEcbDecrypt' [-Werror=nested-externs]

@SparkiDev says AES XTS is not yet support with RISC-V ASM. Note: I tried to use ./configure --enable-all --disable-aesxtx --enable-riscv-asm but that didn't work. We normally support a way to disable a specific option with all. Sean please review.

dgarske avatar May 29 '24 16:05 dgarske

@SparkiDev is this RISC-V ASM PR ready for merge? I can’t tell if you are planning to push anything else to it.

dgarske avatar Jun 03 '24 16:06 dgarske

Fixed --enable-all to work.

SparkiDev avatar Jun 06 '24 03:06 SparkiDev

retest this please

SparkiDev avatar Jun 06 '24 03:06 SparkiDev

Updated benchmarks:

HiFive Unleashed at 1.4GHz

./configure --enable-all --enable-riscv-asm
make

root@HiFiveU:~/wolfssl# ./wolfcrypt/benchmark/benchmark
------------------------------------------------------------------------------
 wolfSSL version 5.7.0
------------------------------------------------------------------------------
Math: 	Multi-Precision: Wolf(SP) word-size=64 bits=4096 sp_int.c
wolfCrypt Benchmark (block bytes 1048576, min 1.0 sec each)
RNG                         10 MiB took 1.488 seconds,    6.721 MiB/s
AES-128-CBC-enc             20 MiB took 1.139 seconds,   17.554 MiB/s
AES-128-CBC-dec             20 MiB took 1.145 seconds,   17.470 MiB/s
AES-192-CBC-enc             20 MiB took 1.321 seconds,   15.144 MiB/s
AES-192-CBC-dec             20 MiB took 1.321 seconds,   15.139 MiB/s
AES-256-CBC-enc             15 MiB took 1.115 seconds,   13.450 MiB/s
AES-256-CBC-dec             15 MiB took 1.123 seconds,   13.361 MiB/s
AES-128-GCM-enc             15 MiB took 1.395 seconds,   10.750 MiB/s
AES-128-GCM-dec             15 MiB took 1.372 seconds,   10.933 MiB/s
AES-192-GCM-enc             10 MiB took 1.007 seconds,    9.930 MiB/s
AES-192-GCM-dec             10 MiB took 1.006 seconds,    9.940 MiB/s
AES-256-GCM-enc             10 MiB took 1.088 seconds,    9.188 MiB/s
AES-256-GCM-dec             10 MiB took 1.088 seconds,    9.192 MiB/s
GMAC Table 4-bit            31 MiB took 1.029 seconds,   30.136 MiB/s
AES-128-ECB-enc             22 MiB took 1.218 seconds,   18.063 MiB/s
AES-128-ECB-dec             22 MiB took 1.209 seconds,   18.191 MiB/s
AES-192-ECB-enc             22 MiB took 1.414 seconds,   15.556 MiB/s
AES-192-ECB-dec             22 MiB took 1.406 seconds,   15.644 MiB/s
AES-256-ECB-enc             22 MiB took 1.601 seconds,   13.740 MiB/s
AES-256-ECB-dec             22 MiB took 1.608 seconds,   13.677 MiB/s
AES-XTS-enc                 15 MiB took 1.193 seconds,   12.569 MiB/s
AES-XTS-dec                 15 MiB took 1.190 seconds,   12.608 MiB/s
AES-128-CFB                 20 MiB took 1.319 seconds,   15.167 MiB/s
AES-192-CFB                 15 MiB took 1.115 seconds,   13.447 MiB/s
AES-256-CFB                 15 MiB took 1.240 seconds,   12.092 MiB/s
AES-128-OFB                 20 MiB took 1.316 seconds,   15.202 MiB/s
AES-192-OFB                 15 MiB took 1.114 seconds,   13.461 MiB/s
AES-256-OFB                 15 MiB took 1.240 seconds,   12.094 MiB/s
AES-128-CTR                 20 MiB took 1.134 seconds,   17.639 MiB/s
AES-192-CTR                 20 MiB took 1.317 seconds,   15.181 MiB/s
AES-256-CTR                 15 MiB took 1.109 seconds,   13.526 MiB/s
AES-CCM-enc                 10 MiB took 1.087 seconds,    9.202 MiB/s
AES-CCM-dec                 10 MiB took 1.088 seconds,    9.194 MiB/s
AES-256-SIV-enc             10 MiB took 1.151 seconds,    8.686 MiB/s
AES-256-SIV-dec             10 MiB took 1.149 seconds,    8.704 MiB/s
AES-384-SIV-enc             10 MiB took 1.330 seconds,    7.521 MiB/s
AES-384-SIV-dec             10 MiB took 1.329 seconds,    7.526 MiB/s
AES-512-SIV-enc             10 MiB took 1.497 seconds,    6.681 MiB/s
AES-512-SIV-dec             10 MiB took 1.496 seconds,    6.683 MiB/s
Camellia                    15 MiB took 1.297 seconds,   11.563 MiB/s
ARC4                        30 MiB took 1.121 seconds,   26.756 MiB/s
CHACHA                      30 MiB took 1.016 seconds,   29.525 MiB/s
CHA-POLY                    25 MiB took 1.140 seconds,   21.934 MiB/s
3DES                         5 MiB took 1.632 seconds,    3.064 MiB/s
MD5                         75 MiB took 1.050 seconds,   71.403 MiB/s
POLY1305                    90 MiB took 1.053 seconds,   85.442 MiB/s
SHA                         35 MiB took 1.101 seconds,   31.787 MiB/s
SHA-224                     20 MiB took 1.112 seconds,   17.980 MiB/s
SHA-256                     20 MiB took 1.114 seconds,   17.952 MiB/s
SHA-384                     15 MiB took 1.359 seconds,   11.038 MiB/s
SHA-512                     15 MiB took 1.315 seconds,   11.406 MiB/s
SHA-512/224                 15 MiB took 1.461 seconds,   10.269 MiB/s
SHA-512/256                 15 MiB took 1.461 seconds,   10.266 MiB/s
SHA3-224                    20 MiB took 1.187 seconds,   16.849 MiB/s
SHA3-256                    20 MiB took 1.250 seconds,   15.998 MiB/s
SHA3-384                    15 MiB took 1.197 seconds,   12.532 MiB/s
SHA3-512                    10 MiB took 1.140 seconds,    8.770 MiB/s
SHAKE128                    20 MiB took 1.034 seconds,   19.339 MiB/s
SHAKE256                    20 MiB took 1.250 seconds,   16.002 MiB/s
RIPEMD                      20 MiB took 1.071 seconds,   18.679 MiB/s
BLAKE2b                     30 MiB took 1.155 seconds,   25.973 MiB/s
BLAKE2s                     20 MiB took 1.202 seconds,   16.637 MiB/s
AES-128-CMAC                20 MiB took 1.166 seconds,   17.149 MiB/s
AES-256-CMAC                15 MiB took 1.136 seconds,   13.200 MiB/s
HMAC-MD5                    75 MiB took 1.050 seconds,   71.403 MiB/s
HMAC-SHA                    35 MiB took 1.099 seconds,   31.834 MiB/s
HMAC-SHA224                 20 MiB took 1.115 seconds,   17.931 MiB/s
HMAC-SHA256                 20 MiB took 1.116 seconds,   17.921 MiB/s
HMAC-SHA384                 20 MiB took 1.134 seconds,   17.640 MiB/s
HMAC-SHA512                 20 MiB took 1.182 seconds,   16.917 MiB/s
PBKDF2                       2 KiB took 1.011 seconds,    2.195 KiB/s
SipHash-8                  130 MiB took 1.018 seconds,  127.690 MiB/s
SipHash-16                 130 MiB took 1.018 seconds,  127.697 MiB/s
KDF      128     SRTP    205045 ops took 1.000 sec, avg 0.005 ms, 205041.431 ops/sec
KDF      256     SRTP    140095 ops took 1.000 sec, avg 0.007 ms, 140092.996 ops/sec
KDF      128    SRTCP    204845 ops took 1.000 sec, avg 0.005 ms, 204843.486 ops/sec
KDF      256    SRTCP    139070 ops took 1.000 sec, avg 0.007 ms, 139067.480 ops/sec
scrypt    17                 10 ops took 5.608 sec, avg 560.843 ms, 1.783 ops/sec
RSA     1024  key gen         6 ops took 1.163 sec, avg 193.831 ms, 5.159 ops/sec
RSA     2048  key gen         1 ops took 2.187 sec, avg 2186.849 ms, 0.457 ops/sec
RSA     2048   public      1400 ops took 1.065 sec, avg 0.761 ms, 1314.340 ops/sec
RSA     2048  private       100 ops took 3.932 sec, avg 39.325 ms, 25.429 ops/sec
DH      2048  key gen       109 ops took 1.007 sec, avg 9.242 ms, 108.205 ops/sec
DH      2048    agree       100 ops took 1.953 sec, avg 19.530 ms, 51.202 ops/sec
ECC   [      SECP256R1]   256  key gen      1000 ops took 1.065 sec, avg 1.065 ms, 939.342 ops/sec
ECDHE [      SECP256R1]   256    agree      1000 ops took 1.014 sec, avg 1.014 ms, 985.994 ops/sec
ECDSA [      SECP256R1]   256     sign       900 ops took 1.112 sec, avg 1.236 ms, 809.309 ops/sec
ECDSA [      SECP256R1]   256   verify       700 ops took 1.030 sec, avg 1.472 ms, 679.428 ops/sec
ECC   [      SECP256R1]   256  encrypt       900 ops took 1.051 sec, avg 1.168 ms, 856.368 ops/sec
ECC   [      SECP256R1]   256  decrypt       800 ops took 1.106 sec, avg 1.382 ms, 723.377 ops/sec
ECC   [BRAINPOOLP256R1]   256  key gen       900 ops took 1.080 sec, avg 1.200 ms, 833.102 ops/sec
ECDHE [BRAINPOOLP256R1]   256    agree       900 ops took 1.034 sec, avg 1.149 ms, 870.528 ops/sec
ECDSA [BRAINPOOLP256R1]   256     sign       800 ops took 1.093 sec, avg 1.366 ms, 731.855 ops/sec
ECDSA [BRAINPOOLP256R1]   256   verify       700 ops took 1.088 sec, avg 1.554 ms, 643.652 ops/sec
ECC   [BRAINPOOLP256R1]   256  encrypt       800 ops took 1.044 sec, avg 1.305 ms, 766.018 ops/sec
ECC   [BRAINPOOLP256R1]   256  decrypt       700 ops took 1.101 sec, avg 1.574 ms, 635.508 ops/sec
CURVE  25519  key gen      1154 ops took 1.000 sec, avg 0.867 ms, 1153.836 ops/sec
CURVE  25519    agree      1200 ops took 1.013 sec, avg 0.844 ms, 1184.526 ops/sec
ED     25519  key gen      2273 ops took 1.000 sec, avg 0.440 ms, 2272.384 ops/sec
ED     25519     sign      2100 ops took 1.032 sec, avg 0.491 ms, 2035.573 ops/sec
ED     25519   verify      1000 ops took 1.035 sec, avg 1.035 ms, 966.428 ops/sec
CURVE    448  key gen       373 ops took 1.002 sec, avg 2.685 ms, 372.413 ops/sec
CURVE    448    agree       400 ops took 1.063 sec, avg 2.659 ms, 376.125 ops/sec
ED       448  key gen       746 ops took 1.000 sec, avg 1.341 ms, 745.990 ops/sec
ED       448     sign       800 ops took 1.121 sec, avg 1.401 ms, 713.916 ops/sec
ED       448   verify       400 ops took 1.316 sec, avg 3.289 ms, 303.998 ops/sec
ECCSI    256  key gen       774 ops took 1.001 sec, avg 1.293 ms, 773.568 ops/sec
ECCSI    256 pair gen       937 ops took 1.000 sec, avg 1.067 ms, 936.782 ops/sec
ECCSI    256    valid       582 ops took 1.000 sec, avg 1.719 ms, 581.719 ops/sec
ECCSI    256     sign       854 ops took 1.001 sec, avg 1.172 ms, 853.512 ops/sec
ECCSI    256   verify       247 ops took 1.000 sec, avg 4.049 ms, 246.951 ops/sec
SAKKE   1024  key gen        15 ops took 1.008 sec, avg 67.193 ms, 14.882 ops/sec
SAKKE   1024  rsk gen        39 ops took 1.017 sec, avg 26.083 ms, 38.339 ops/sec
SAKKE   1024    valid         4 ops took 1.105 sec, avg 276.192 ms, 3.621 ops/sec
SAKKE   1024    encap-1       6 ops took 1.037 sec, avg 172.764 ms, 5.788 ops/sec
SAKKE   1024   derive-1       4 ops took 1.189 sec, avg 297.185 ms, 3.365 ops/sec
SAKKE   1024    encap-2       6 ops took 1.032 sec, avg 172.008 ms, 5.814 ops/sec
SAKKE   1024   derive-2       4 ops took 1.188 sec, avg 296.979 ms, 3.367 ops/sec
SAKKE   1024   derive-3       4 ops took 1.187 sec, avg 296.857 ms, 3.369 ops/sec
SAKKE   1024   derive-4       4 ops took 1.188 sec, avg 296.881 ms, 3.368 ops/sec
Benchmark complete

dgarske avatar Jun 06 '24 15:06 dgarske

@dgarske - question on the benchmarks fix data size vs fixed time:

In the master and riscv_aes_asm branch you ran these commands, respectively:

# before, on master
./configure —enable-all

vs

# after, with ASM Optimization
./configure --enable-all --enable-riscv-asm

Then for comparison your ran this for both:

./wolfcrypt/benchmark/benchmark -aes-cbc -aes-gcm

The output on master took a fixed 5MB chunk of data and timed the completion: in this example 12.798 seconds:

AES-128-CBC-enc              5 MiB took 12.798 seconds,    0.391 MiB/s

The output on riscv_aes_asm completed as soon as reasonable after a fixed one second duration and determined the amount of data processed:

AES-128-CBC-enc             20 MiB took 1.076 seconds,   18.588 MiB/s

Why the difference in fixed data size vs fixed time?

Additionally, perhaps just nit-picky, but curious: it appears there was also difference in bits size generated. bits=4096:

# master
Math:   Multi-Precision: Wolf(SP) word-size=64 bits=4096 sp_int.c

vs bits=3072

# ASM
Math:   Multi-Precision: Wolf(SP) word-size=64 bits=3072 sp_int.c

It appears that ./configure --enable-all --enable-riscv-asm produces different user_settings.h than ./configure —enable-all affecting more than just assembly optimization. Perhaps it should be consistent, at least for the benchmark configuration? I'm also left wondering for a real apples-to-apples if master was set to bits=3072 whether there would be a performance difference?

In any case - that's an astonishing performance boost by @SparkiDev :)

gojimmypi avatar Jun 14 '24 08:06 gojimmypi

The size of data processed is the number of 1048576 byte (=1MB) buffers encrypted/decrypted. We do a minimum number of buffers regardless of platform but no less than for 1 second.

I have no idea why the number of bits in SP changed though.

SparkiDev avatar Jun 17 '24 06:06 SparkiDev