wolfssl icon indicating copy to clipboard operation
wolfssl copied to clipboard

Kyber ASM ARMv7E-M: added assembly code

Open SparkiDev opened this issue 1 year ago • 1 comments

Description

Improved performance by reworking kyber_ntt, kyber_invtt, kyber_basemul_mont, kyber_basemul_mont_add to be in assembly.

Testing

./configure '--disable-shared' '--enable-experimental' '--enable-kyber' '--enable-cryptonly' '--disable-rsa' '--disable-dh' '--disable-ecc' 'LDFLAGS=--static' '--host=armv7m' 'CC=arm-linux-gnueabi-gcc' '--enable-armasm'

Checklist

  • [ ] added tests
  • [ ] updated/added doxygen
  • [ ] updated appropriate READMEs
  • [ ] Updated manual and documentation

SparkiDev avatar Jul 03 '24 07:07 SparkiDev

Tested on STM32H7A3ZI at 240MHz (Cortex M7)

Using:

#define WOLFSSL_EXPERIMENTAL_SETTINGS

#define WOLFSSL_SHA3
#define WOLFSSL_SHAKE128
#define WOLFSSL_SHAKE256

#define WOLFSSL_HAVE_KYBER
#define WOLFSSL_WC_KYBER
//#define WOLFSSL_KYBER_SMALL

#define WOLFSSL_ARMASM
#define WOLFSSL_ARMASM_INLINE
#define WOLFSSL_ARMASM_NO_HW_CRYPTO
#define WOLFSSL_ARMASM_NO_NEON
#define WOLFSSL_ARMASM_CRYPTO_SHA3
#define WOLFSSL_ARM_ARCH 7

Current Master (before this PR):

RNG                        975 KiB took 1.024 seconds,  952.148 KiB/s
SHA-256                      3 MiB took 1.004 seconds,    3.088 MiB/s
SHA3-224                     1 MiB took 1.012 seconds,    1.399 MiB/s
SHA3-256                     1 MiB took 1.016 seconds,    1.322 MiB/s
SHA3-384                     1 MiB took 1.000 seconds,    1.025 MiB/s
SHA3-512                   750 KiB took 1.016 seconds,  738.189 KiB/s
SHAKE128                     2 MiB took 1.004 seconds,    1.605 MiB/s
SHAKE256                     1 MiB took 1.015 seconds,    1.323 MiB/s
KYBER512    128  key gen       220 ops took 1.008 sec, avg 4.582 ms, 218.254 ops/sec
KYBER512    128    encap       202 ops took 1.000 sec, avg 4.950 ms, 202.000 ops/sec
KYBER512    128    decap       182 ops took 1.000 sec, avg 5.495 ms, 182.000 ops/sec
KYBER768    192  key gen       142 ops took 1.011 sec, avg 7.120 ms, 140.455 ops/sec
KYBER768    192    encap       124 ops took 1.000 sec, avg 8.065 ms, 124.000 ops/sec
KYBER768    192    decap       114 ops took 1.008 sec, avg 8.842 ms, 113.095 ops/sec
KYBER1024   256  key gen        92 ops took 1.011 sec, avg 10.989 ms, 90.999 ops/sec
KYBER1024   256    encap        82 ops took 1.012 sec, avg 12.341 ms, 81.028 ops/sec
KYBER1024   256    decap        76 ops took 1.016 sec, avg 13.368 ms, 74.803 ops/sec

With PR 7706:

RNG                        975 KiB took 1.016 seconds,  959.646 KiB/s
SHA-256                      3 MiB took 1.004 seconds,    2.967 MiB/s
SHA3-224                     1 MiB took 1.015 seconds,    1.395 MiB/s
SHA3-256                     1 MiB took 1.000 seconds,    1.318 MiB/s
SHA3-384                     1 MiB took 1.004 seconds,    1.021 MiB/s
SHA3-512                   750 KiB took 1.019 seconds,  736.016 KiB/s
SHAKE128                     2 MiB took 1.008 seconds,    1.599 MiB/s
SHAKE256                     1 MiB took 1.004 seconds,    1.313 MiB/s
KYBER512    128  key gen       238 ops took 1.000 sec, avg 4.202 ms, 238.000 ops/sec
KYBER512    128    encap       226 ops took 1.004 sec, avg 4.442 ms, 225.100 ops/sec
KYBER512    128    decap       212 ops took 1.000 sec, avg 4.717 ms, 212.000 ops/sec
KYBER768    192  key gen       156 ops took 1.008 sec, avg 6.462 ms, 154.762 ops/sec
KYBER768    192    encap       140 ops took 1.012 sec, avg 7.229 ms, 138.340 ops/sec
KYBER768    192    decap       132 ops took 1.007 sec, avg 7.629 ms, 131.082 ops/sec
KYBER1024   256  key gen       102 ops took 1.016 sec, avg 9.961 ms, 100.394 ops/sec
KYBER1024   256    encap        90 ops took 1.000 sec, avg 11.111 ms, 90.000 ops/sec
KYBER1024   256    decap        86 ops took 1.000 sec, avg 11.628 ms, 86.000 ops/sec

Note benchmark won't run Kyber without -kyber or with this patch:

diff --git a/wolfcrypt/benchmark/benchmark.c b/wolfcrypt/benchmark/benchmark.c
index 964f9ebd0..1082de63c 100644
--- a/wolfcrypt/benchmark/benchmark.c
+++ b/wolfcrypt/benchmark/benchmark.c
@@ -3593,17 +3593,17 @@ static void* benchmarks_do(void* args)
 #ifdef WOLFSSL_HAVE_KYBER
     if (bench_all || (bench_pq_asym_algs & BENCH_KYBER)) {
     #ifdef WOLFSSL_KYBER512
-        if (bench_pq_asym_algs & BENCH_KYBER512) {
+        if (bench_all || (bench_pq_asym_algs & BENCH_KYBER512)) {
             bench_kyber(KYBER512);
         }
     #endif
     #ifdef WOLFSSL_KYBER768
-        if (bench_pq_asym_algs & BENCH_KYBER768) {
+        if (bench_all || (bench_pq_asym_algs & BENCH_KYBER768)) {
             bench_kyber(KYBER768);
         }
     #endif
     #ifdef WOLFSSL_KYBER1024
-        if (bench_pq_asym_algs & BENCH_KYBER1024) {
+        if (bench_all || (bench_pq_asym_algs & BENCH_KYBER1024)) {
             bench_kyber(KYBER1024);
         }
     #endif

dgarske avatar Jul 03 '24 23:07 dgarske

retest this please

SparkiDev avatar Oct 03 '24 08:10 SparkiDev

Re-ran on the same target STM32H7A3ZI at 240MHz with -Os: Seems to be about 50% faster!

Please select one of the above options:
Running wolfCrypt Benchmarks...
wolfCrypt Benchmark (block bytes 1024, min 1.0 sec each)
RNG                          1 MiB took 1.011 seconds,    1.087 MiB/s
AES-128-CBC-enc              1 MiB took 1.000 seconds,    1.489 MiB/s
AES-128-CBC-dec              1 MiB took 1.004 seconds,    1.483 MiB/s
AES-192-CBC-enc              1 MiB took 1.012 seconds,    1.254 MiB/s
AES-192-CBC-dec              1 MiB took 1.007 seconds,    1.236 MiB/s
AES-256-CBC-enc              1 MiB took 1.016 seconds,    1.081 MiB/s
AES-256-CBC-dec              1 MiB took 1.008 seconds,    1.066 MiB/s
AES-128-GCM-enc           1000 KiB took 1.004 seconds,  996.016 KiB/s
AES-128-GCM-dec           1000 KiB took 1.016 seconds,  984.252 KiB/s
AES-192-GCM-enc            900 KiB took 1.020 seconds,  882.353 KiB/s
AES-192-GCM-dec            875 KiB took 1.000 seconds,  875.000 KiB/s
AES-256-GCM-enc            800 KiB took 1.004 seconds,  796.813 KiB/s
AES-256-GCM-dec            800 KiB took 1.016 seconds,  787.402 KiB/s
AES-128-GCM-enc-no_AAD       1 MiB took 1.019 seconds,    0.982 MiB/s
AES-128-GCM-dec-no_AAD    1000 KiB took 1.004 seconds,  996.016 KiB/s
AES-192-GCM-enc-no_AAD     900 KiB took 1.012 seconds,  889.328 KiB/s
AES-192-GCM-dec-no_AAD     900 KiB took 1.019 seconds,  883.219 KiB/s
AES-256-GCM-enc-no_AAD     800 KiB took 1.000 seconds,  800.000 KiB/s
AES-256-GCM-dec-no_AAD     800 KiB took 1.012 seconds,  790.514 KiB/s
GMAC Table 4-bit             3 MiB took 1.000 seconds,    2.907 MiB/s
CHACHA                       7 MiB took 1.000 seconds,    7.056 MiB/s
CHA-POLY                     5 MiB took 1.004 seconds,    4.936 MiB/s
POLY1305                    29 MiB took 1.000 seconds,   29.297 MiB/s
SHA-256                      3 MiB took 1.000 seconds,    3.027 MiB/s
SHA3-224                     1 MiB took 1.012 seconds,    1.423 MiB/s
SHA3-256                     1 MiB took 1.000 seconds,    1.343 MiB/s
SHA3-384                     1 MiB took 1.012 seconds,    1.037 MiB/s
SHA3-512                   750 KiB took 1.008 seconds,  744.048 KiB/s
SHAKE128                     2 MiB took 1.012 seconds,    1.640 MiB/s
SHAKE256                     1 MiB took 1.016 seconds,    1.346 MiB/s
HMAC-SHA256                  3 MiB took 1.000 seconds,    3.003 MiB/s
RSA     2048   public       100 ops took 1.004 sec, avg 10.040 ms, 99.602 ops/sec
RSA     2048  private         4 ops took 1.619 sec, avg 404.750 ms, 2.471 ops/sec
DH      2048  key gen         7 ops took 1.149 sec, avg 164.143 ms, 6.092 ops/sec
DH      2048    agree         8 ops took 1.309 sec, avg 163.625 ms, 6.112 ops/sec
KYBER512    128  key gen       362 ops took 1.000 sec, avg 2.762 ms, 362.000 ops/sec
KYBER512    128    encap       352 ops took 1.000 sec, avg 2.841 ms, 352.000 ops/sec
KYBER512    128    decap       260 ops took 1.004 sec, avg 3.862 ms, 258.964 ops/sec
KYBER768    192  key gen       224 ops took 1.008 sec, avg 4.500 ms, 222.222 ops/sec
KYBER768    192    encap       210 ops took 1.004 sec, avg 4.781 ms, 209.163 ops/sec
KYBER768    192    decap       160 ops took 1.000 sec, avg 6.250 ms, 160.000 ops/sec
KYBER1024   256  key gen       136 ops took 1.008 sec, avg 7.412 ms, 134.921 ops/sec
KYBER1024   256    encap       130 ops took 1.008 sec, avg 7.754 ms, 128.968 ops/sec
KYBER1024   256    decap       104 ops took 1.004 sec, avg 9.654 ms, 103.586 ops/sec
ECC   [      SECP256R1]   256  key gen       226 ops took 1.004 sec, avg 4.442 ms, 225.100 ops/sec
ECDHE [      SECP256R1]   256    agree       116 ops took 1.008 sec, avg 8.690 ms, 115.079 ops/sec
ECDSA [      SECP256R1]   256     sign       104 ops took 1.008 sec, avg 9.692 ms, 103.175 ops/sec
ECDSA [      SECP256R1]   256   verify        68 ops took 1.024 sec, avg 15.059 ms, 66.406 ops/sec
ML-DSA    44  key gen        76 ops took 1.020 sec, avg 13.421 ms, 74.510 ops/sec
ML-DSA    44     sign        30 ops took 1.047 sec, avg 34.900 ms, 28.653 ops/sec
ML-DSA    44   verify        74 ops took 1.008 sec, avg 13.622 ms, 73.413 ops/sec
ML-DSA    65  key gen        44 ops took 1.024 sec, avg 23.273 ms, 42.969 ops/sec
ML-DSA    65     sign        18 ops took 1.137 sec, avg 63.167 ms, 15.831 ops/sec
ML-DSA    65   verify        44 ops took 1.000 sec, avg 22.727 ms, 44.000 ops/sec
ML-DSA    87  key gen        26 ops took 1.019 sec, avg 39.192 ms, 25.515 ops/sec
ML-DSA    87     sign        14 ops took 1.090 sec, avg 77.857 ms, 12.844 ops/sec
ML-DSA    87   verify        26 ops took 1.004 sec, avg 38.615 ms, 25.896 ops/sec
Benchmark complete
Benchmark Test: Return code 0

dgarske avatar Oct 03 '24 17:10 dgarske