wolfssl
wolfssl copied to clipboard
Kyber ASM ARMv7E-M: added assembly code
Description
Improved performance by reworking kyber_ntt, kyber_invtt, kyber_basemul_mont, kyber_basemul_mont_add to be in assembly.
Testing
./configure '--disable-shared' '--enable-experimental' '--enable-kyber' '--enable-cryptonly' '--disable-rsa' '--disable-dh' '--disable-ecc' 'LDFLAGS=--static' '--host=armv7m' 'CC=arm-linux-gnueabi-gcc' '--enable-armasm'
Checklist
- [ ] added tests
- [ ] updated/added doxygen
- [ ] updated appropriate READMEs
- [ ] Updated manual and documentation
Tested on STM32H7A3ZI at 240MHz (Cortex M7)
Using:
#define WOLFSSL_EXPERIMENTAL_SETTINGS
#define WOLFSSL_SHA3
#define WOLFSSL_SHAKE128
#define WOLFSSL_SHAKE256
#define WOLFSSL_HAVE_KYBER
#define WOLFSSL_WC_KYBER
//#define WOLFSSL_KYBER_SMALL
#define WOLFSSL_ARMASM
#define WOLFSSL_ARMASM_INLINE
#define WOLFSSL_ARMASM_NO_HW_CRYPTO
#define WOLFSSL_ARMASM_NO_NEON
#define WOLFSSL_ARMASM_CRYPTO_SHA3
#define WOLFSSL_ARM_ARCH 7
Current Master (before this PR):
RNG 975 KiB took 1.024 seconds, 952.148 KiB/s
SHA-256 3 MiB took 1.004 seconds, 3.088 MiB/s
SHA3-224 1 MiB took 1.012 seconds, 1.399 MiB/s
SHA3-256 1 MiB took 1.016 seconds, 1.322 MiB/s
SHA3-384 1 MiB took 1.000 seconds, 1.025 MiB/s
SHA3-512 750 KiB took 1.016 seconds, 738.189 KiB/s
SHAKE128 2 MiB took 1.004 seconds, 1.605 MiB/s
SHAKE256 1 MiB took 1.015 seconds, 1.323 MiB/s
KYBER512 128 key gen 220 ops took 1.008 sec, avg 4.582 ms, 218.254 ops/sec
KYBER512 128 encap 202 ops took 1.000 sec, avg 4.950 ms, 202.000 ops/sec
KYBER512 128 decap 182 ops took 1.000 sec, avg 5.495 ms, 182.000 ops/sec
KYBER768 192 key gen 142 ops took 1.011 sec, avg 7.120 ms, 140.455 ops/sec
KYBER768 192 encap 124 ops took 1.000 sec, avg 8.065 ms, 124.000 ops/sec
KYBER768 192 decap 114 ops took 1.008 sec, avg 8.842 ms, 113.095 ops/sec
KYBER1024 256 key gen 92 ops took 1.011 sec, avg 10.989 ms, 90.999 ops/sec
KYBER1024 256 encap 82 ops took 1.012 sec, avg 12.341 ms, 81.028 ops/sec
KYBER1024 256 decap 76 ops took 1.016 sec, avg 13.368 ms, 74.803 ops/sec
With PR 7706:
RNG 975 KiB took 1.016 seconds, 959.646 KiB/s
SHA-256 3 MiB took 1.004 seconds, 2.967 MiB/s
SHA3-224 1 MiB took 1.015 seconds, 1.395 MiB/s
SHA3-256 1 MiB took 1.000 seconds, 1.318 MiB/s
SHA3-384 1 MiB took 1.004 seconds, 1.021 MiB/s
SHA3-512 750 KiB took 1.019 seconds, 736.016 KiB/s
SHAKE128 2 MiB took 1.008 seconds, 1.599 MiB/s
SHAKE256 1 MiB took 1.004 seconds, 1.313 MiB/s
KYBER512 128 key gen 238 ops took 1.000 sec, avg 4.202 ms, 238.000 ops/sec
KYBER512 128 encap 226 ops took 1.004 sec, avg 4.442 ms, 225.100 ops/sec
KYBER512 128 decap 212 ops took 1.000 sec, avg 4.717 ms, 212.000 ops/sec
KYBER768 192 key gen 156 ops took 1.008 sec, avg 6.462 ms, 154.762 ops/sec
KYBER768 192 encap 140 ops took 1.012 sec, avg 7.229 ms, 138.340 ops/sec
KYBER768 192 decap 132 ops took 1.007 sec, avg 7.629 ms, 131.082 ops/sec
KYBER1024 256 key gen 102 ops took 1.016 sec, avg 9.961 ms, 100.394 ops/sec
KYBER1024 256 encap 90 ops took 1.000 sec, avg 11.111 ms, 90.000 ops/sec
KYBER1024 256 decap 86 ops took 1.000 sec, avg 11.628 ms, 86.000 ops/sec
Note benchmark won't run Kyber without -kyber or with this patch:
diff --git a/wolfcrypt/benchmark/benchmark.c b/wolfcrypt/benchmark/benchmark.c
index 964f9ebd0..1082de63c 100644
--- a/wolfcrypt/benchmark/benchmark.c
+++ b/wolfcrypt/benchmark/benchmark.c
@@ -3593,17 +3593,17 @@ static void* benchmarks_do(void* args)
#ifdef WOLFSSL_HAVE_KYBER
if (bench_all || (bench_pq_asym_algs & BENCH_KYBER)) {
#ifdef WOLFSSL_KYBER512
- if (bench_pq_asym_algs & BENCH_KYBER512) {
+ if (bench_all || (bench_pq_asym_algs & BENCH_KYBER512)) {
bench_kyber(KYBER512);
}
#endif
#ifdef WOLFSSL_KYBER768
- if (bench_pq_asym_algs & BENCH_KYBER768) {
+ if (bench_all || (bench_pq_asym_algs & BENCH_KYBER768)) {
bench_kyber(KYBER768);
}
#endif
#ifdef WOLFSSL_KYBER1024
- if (bench_pq_asym_algs & BENCH_KYBER1024) {
+ if (bench_all || (bench_pq_asym_algs & BENCH_KYBER1024)) {
bench_kyber(KYBER1024);
}
#endif
retest this please
Re-ran on the same target STM32H7A3ZI at 240MHz with -Os:
Seems to be about 50% faster!
Please select one of the above options:
Running wolfCrypt Benchmarks...
wolfCrypt Benchmark (block bytes 1024, min 1.0 sec each)
RNG 1 MiB took 1.011 seconds, 1.087 MiB/s
AES-128-CBC-enc 1 MiB took 1.000 seconds, 1.489 MiB/s
AES-128-CBC-dec 1 MiB took 1.004 seconds, 1.483 MiB/s
AES-192-CBC-enc 1 MiB took 1.012 seconds, 1.254 MiB/s
AES-192-CBC-dec 1 MiB took 1.007 seconds, 1.236 MiB/s
AES-256-CBC-enc 1 MiB took 1.016 seconds, 1.081 MiB/s
AES-256-CBC-dec 1 MiB took 1.008 seconds, 1.066 MiB/s
AES-128-GCM-enc 1000 KiB took 1.004 seconds, 996.016 KiB/s
AES-128-GCM-dec 1000 KiB took 1.016 seconds, 984.252 KiB/s
AES-192-GCM-enc 900 KiB took 1.020 seconds, 882.353 KiB/s
AES-192-GCM-dec 875 KiB took 1.000 seconds, 875.000 KiB/s
AES-256-GCM-enc 800 KiB took 1.004 seconds, 796.813 KiB/s
AES-256-GCM-dec 800 KiB took 1.016 seconds, 787.402 KiB/s
AES-128-GCM-enc-no_AAD 1 MiB took 1.019 seconds, 0.982 MiB/s
AES-128-GCM-dec-no_AAD 1000 KiB took 1.004 seconds, 996.016 KiB/s
AES-192-GCM-enc-no_AAD 900 KiB took 1.012 seconds, 889.328 KiB/s
AES-192-GCM-dec-no_AAD 900 KiB took 1.019 seconds, 883.219 KiB/s
AES-256-GCM-enc-no_AAD 800 KiB took 1.000 seconds, 800.000 KiB/s
AES-256-GCM-dec-no_AAD 800 KiB took 1.012 seconds, 790.514 KiB/s
GMAC Table 4-bit 3 MiB took 1.000 seconds, 2.907 MiB/s
CHACHA 7 MiB took 1.000 seconds, 7.056 MiB/s
CHA-POLY 5 MiB took 1.004 seconds, 4.936 MiB/s
POLY1305 29 MiB took 1.000 seconds, 29.297 MiB/s
SHA-256 3 MiB took 1.000 seconds, 3.027 MiB/s
SHA3-224 1 MiB took 1.012 seconds, 1.423 MiB/s
SHA3-256 1 MiB took 1.000 seconds, 1.343 MiB/s
SHA3-384 1 MiB took 1.012 seconds, 1.037 MiB/s
SHA3-512 750 KiB took 1.008 seconds, 744.048 KiB/s
SHAKE128 2 MiB took 1.012 seconds, 1.640 MiB/s
SHAKE256 1 MiB took 1.016 seconds, 1.346 MiB/s
HMAC-SHA256 3 MiB took 1.000 seconds, 3.003 MiB/s
RSA 2048 public 100 ops took 1.004 sec, avg 10.040 ms, 99.602 ops/sec
RSA 2048 private 4 ops took 1.619 sec, avg 404.750 ms, 2.471 ops/sec
DH 2048 key gen 7 ops took 1.149 sec, avg 164.143 ms, 6.092 ops/sec
DH 2048 agree 8 ops took 1.309 sec, avg 163.625 ms, 6.112 ops/sec
KYBER512 128 key gen 362 ops took 1.000 sec, avg 2.762 ms, 362.000 ops/sec
KYBER512 128 encap 352 ops took 1.000 sec, avg 2.841 ms, 352.000 ops/sec
KYBER512 128 decap 260 ops took 1.004 sec, avg 3.862 ms, 258.964 ops/sec
KYBER768 192 key gen 224 ops took 1.008 sec, avg 4.500 ms, 222.222 ops/sec
KYBER768 192 encap 210 ops took 1.004 sec, avg 4.781 ms, 209.163 ops/sec
KYBER768 192 decap 160 ops took 1.000 sec, avg 6.250 ms, 160.000 ops/sec
KYBER1024 256 key gen 136 ops took 1.008 sec, avg 7.412 ms, 134.921 ops/sec
KYBER1024 256 encap 130 ops took 1.008 sec, avg 7.754 ms, 128.968 ops/sec
KYBER1024 256 decap 104 ops took 1.004 sec, avg 9.654 ms, 103.586 ops/sec
ECC [ SECP256R1] 256 key gen 226 ops took 1.004 sec, avg 4.442 ms, 225.100 ops/sec
ECDHE [ SECP256R1] 256 agree 116 ops took 1.008 sec, avg 8.690 ms, 115.079 ops/sec
ECDSA [ SECP256R1] 256 sign 104 ops took 1.008 sec, avg 9.692 ms, 103.175 ops/sec
ECDSA [ SECP256R1] 256 verify 68 ops took 1.024 sec, avg 15.059 ms, 66.406 ops/sec
ML-DSA 44 key gen 76 ops took 1.020 sec, avg 13.421 ms, 74.510 ops/sec
ML-DSA 44 sign 30 ops took 1.047 sec, avg 34.900 ms, 28.653 ops/sec
ML-DSA 44 verify 74 ops took 1.008 sec, avg 13.622 ms, 73.413 ops/sec
ML-DSA 65 key gen 44 ops took 1.024 sec, avg 23.273 ms, 42.969 ops/sec
ML-DSA 65 sign 18 ops took 1.137 sec, avg 63.167 ms, 15.831 ops/sec
ML-DSA 65 verify 44 ops took 1.000 sec, avg 22.727 ms, 44.000 ops/sec
ML-DSA 87 key gen 26 ops took 1.019 sec, avg 39.192 ms, 25.515 ops/sec
ML-DSA 87 sign 14 ops took 1.090 sec, avg 77.857 ms, 12.844 ops/sec
ML-DSA 87 verify 26 ops took 1.004 sec, avg 38.615 ms, 25.896 ops/sec
Benchmark complete
Benchmark Test: Return code 0