wolfssl
wolfssl copied to clipboard
ARM ASM: ARMv7a with NEON instructions
Description
Change to build assembly code for ARMv7a with NEON instruction set. ./configure -host=armv7a --enable-armasm Added ARM32 SHA-256 NEON only implementation.
Testing
Tested SHA-256 NEON implementation using QEMU.
Checklist
- [ ] added tests
- [ ] updated/added doxygen
- [ ] updated appropriate READMEs
- [ ] Updated manual and documentation
retest this please
@SparkiDev I've just done some testing on this.
First thing I had to change the armv7a
to armv7a*
in configure.ac L2050, to accommodate the standard target tuple armv7a-unknown-linux-gnueabihf
.
That done, to get --enable-armasm
to build, I had to include --disable-chacha --disable-xchacha --disable-poly1305
, because of errors like this:
/tmp/tmp.4346_8721/ccFSpc8A.s: Assembler messages:
/tmp/tmp.4346_8721/ccFSpc8A.s:270: Error: first transfer register must be even -- `ldrd r11,r10,[r14,#4*14]'
make[2]: *** [Makefile:5938: wolfcrypt/src/port/arm/src_libwolfssl_la-armv8-chacha.lo] Error 1
Once I did that, testsuite.test
nominally succeeded, but crashed at exit:
[...]
mutex test passed!
memcb test passed!
Test complete
qemu: uncaught target signal 7 (Bus error) - core dumped
(This was repeatable.)
Other bits:
With -pedantic
, this happens:
wolfcrypt/src/sha256.c:2002: error: ISO C forbids an empty translation unit [-Werror=pedantic]
Looks like an include.am oversight.
And finally, --enable-all --enable-armasm
on armv7a breaks as follows:
wolfcrypt/src/port/arm/armv8-sha256.c: In function ‘wc_Sha256Transform’:
089ebf277f (<[email protected]> 2021-03-23 12:53:06 +1000 1540) Sha256Transform(sha256, data, 1);
wolfcrypt/src/port/arm/armv8-sha256.c:1540:5: error: implicit declaration of function ‘Sha256Transform’; did you mean ‘wc_Sha256Transform’? [-Werror=implicit-function-declaration]
1540 | Sha256Transform(sha256, data, 1);
| ^~~~~~~~~~~~~~~
| wc_Sha256Transform
089ebf277f (<[email protected]> 2021-03-23 12:53:06 +1000 1540) Sha256Transform(sha256, data, 1);
wolfcrypt/src/port/arm/armv8-sha256.c:1540:5: error: nested extern declaration of ‘Sha256Transform’ [-Werror=nested-externs]
This just has to be related to the wholesale replacement of sha256.c that happens for armasm.
The toolchain versions for the above:
[I-O] [ ] cross-armv7a-unknown-linux-gnueabihf/binutils-2.38-r2:2.38
[I-O] [ ] cross-armv7a-unknown-linux-gnueabihf/gcc-12.1.1_p20220625:12
[I-O] [ ] cross-armv7a-unknown-linux-gnueabihf/linux-headers-5.19:0
[I-O] [ ] cross-armv7a-unknown-linux-gnueabihf/glibc-2.35-r8:2.2
[IP-] [ ] app-emulation/qemu-7.0.0-r3:0
Had to disable the same things as @douzzer. But, I was able to get some benchmarking results on my Raspberry Pi:
pi@raspberrypi:~/wolfssl $ lscpu; cat before_sha256.txt; cat after_sha256.txt; cat before_sha512.txt; cat after_sha512.txt
Architecture: armv7l
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Vendor ID: ARM
Model: 3
Model name: Cortex-A72
Stepping: r0p3
CPU max MHz: 1500.0000
CPU min MHz: 600.0000
BogoMIPS: 108.00
Flags: half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32
------------------------------------------------------------------------------
wolfSSL version 5.4.0
------------------------------------------------------------------------------
wolfCrypt Benchmark (block bytes 1048576, min 1.0 sec each)
SHA-256 80 MB took 1.008 seconds, 79.339 MB/s
Benchmark complete
------------------------------------------------------------------------------
wolfSSL version 5.4.0
------------------------------------------------------------------------------
wolfCrypt Benchmark (block bytes 1048576, min 1.0 sec each)
SHA-256 120 MB took 1.020 seconds, 117.677 MB/s
Benchmark complete
------------------------------------------------------------------------------
wolfSSL version 5.4.0
------------------------------------------------------------------------------
wolfCrypt Benchmark (block bytes 1048576, min 1.0 sec each)
SHA-512 40 MB took 1.019 seconds, 39.273 MB/s
Benchmark complete
------------------------------------------------------------------------------
wolfSSL version 5.4.0
------------------------------------------------------------------------------
wolfCrypt Benchmark (block bytes 1048576, min 1.0 sec each)
SHA-512 75 MB took 1.064 seconds, 70.472 MB/s
Benchmark complete
Looking quite a bit faster than before!
OpenSSL numbers:
pi@raspberrypi:~/wolfssl $ cat openssl_sha256.txt; echo ""; cat openssl_sha512.txt
Doing sha256 for 3s on 1048576 size blocks: 469 sha256's in 3.00s
OpenSSL 1.1.1d 10 Sep 2019
built on: Fri Jan 31 15:37:19 2020 UTC
options:bn(64,32) rc4(char) des(long) aes(partial) blowfish(ptr)
compiler: gcc -fPIC -pthread -Wa,--noexecstack -Wall -D__ARM_MAX_ARCH__=7 -Wa,--noexecstack -g -O2 -fdebug-prefix-map=/build/openssl-ueKbAp/openssl-1.1.1d=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DAES_ASM -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
The 'numbers' are in 1000s of bytes per second processed.
type 1048576 bytes
sha256 163927.38k
Doing sha512 for 3s on 1048576 size blocks: 270 sha512's in 3.00s
OpenSSL 1.1.1d 10 Sep 2019
built on: Fri Jan 31 15:37:19 2020 UTC
options:bn(64,32) rc4(char) des(long) aes(partial) blowfish(ptr)
compiler: gcc -fPIC -pthread -Wa,--noexecstack -Wall -D__ARM_MAX_ARCH__=7 -Wa,--noexecstack -g -O2 -fdebug-prefix-map=/build/openssl-ueKbAp/openssl-1.1.1d=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DAES_ASM -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
The 'numbers' are in 1000s of bytes per second processed.
type 1048576 bytes
sha512 94371.84k
just tried this rebased on latest master, with the configure.ac tweak to recognize armv7a-unknown-linux-gnueabihf
:
autogen.sh 08b4fd6f2e-dirty... real 0m14.917s user 0m13.195s sys 0m0.859s
configure... real 0m8.970s user 0m4.785s sys 0m5.044s
build... real 0m30.472s user 2m5.040s sys 0m6.234s
testsuite.test... real 0m28.347s user 0m28.304s sys 0m0.014s
testsuite.test for scenario cross-armv7a-all-armasm exited with status 135.
================================================================================
TLSv1.3 KDF test passed!
X963-KDF test passed!
GMAC test passed!
ARC4 test passed!
DES test passed!
DES3 test passed!
AES test passed!
AES192 test passed!
AES256 test passed!
AESOFB test passed!
AES-GCM test passed!
AES-CCM test passed!
AES Key Wrap test passed!
AES-SIV test passed!
CAMELLIA test passed!
RSA NOPAD test passed!
RSA test passed!
DH test passed!
DSA test passed!
qemu: uncaught target signal 7 (Bus error) - core dumped
[... retries ...]
cross-armv7a-all-armasm fail_check
failed config: '--enable-all' 'CPPFLAGS=-DNO_WOLFSSL_CIPHER_SUITE_TEST -DWOLFSSL_OLD_PRIME_CHECK -pedantic' '--enable-asn=template' '--enable-armasm' '--disable-chacha' '--disable-poly1305' '--disable-xchacha' '--host=armv7a-unknown-linux-gnueabihf' 'FILECMD=/bin/false' 'MANIFEST_TOOL=/bin/false'
@douzzer
Can you please try without SRP (--disable-srp). I'm not seeing any issue with QEMU on Linux. And also try to enable ChaCha, Poly1305 and XChaCha.
Thanks, Sean
shouldn't -mfpu=neon-vfpv3
just be plain -mfpu=neon
or -mfpu=neon-vfp
, neon
being a generic Arm processor, neon-vfp
being certain types of Arm processor, and neon-vfpv3
being specific Arm processor (which exludes v1, v2 & v4)?
Hi @paulwratt,
I have found it difficult in the past to get the right mfpu setting that will work! Using 'neon' works for me with QEMU so I've changed configure.ac. Note that the appropriate mfpu setting should be decided by each customer based on their hardware.
Thanks, Sean
Tested with ./configure --host=armv7a --enable-armasm
on Raspberry Pi. No build errors, now, but testwolfcrypt is failing:
pi@raspberrypi:~/wolfssl $ ./wolfcrypt/test/testwolfcrypt
------------------------------------------------------------------------------
wolfSSL version 5.5.0
------------------------------------------------------------------------------
error test passed!
MEMORY test passed!
base64 test passed!
asn test passed!
RANDOM test passed!
MD5 test passed!
SHA test passed!
SHA-256 test passed!
SHA-384 test passed!
SHA-512 test passed!
Hash test passed!
HMAC-MD5 test passed!
HMAC-SHA test passed!
HMAC-SHA256 test passed!
HMAC-SHA384 test passed!
HMAC-SHA512 test passed!
HMAC-KDF test passed!
TLSv1.3 KDF test passed!
GMAC test passed!
Chacha test failed!
error = -4726
Exiting main with return code: -1
@haydenroche5 The error number is not unique. I've updated the test and made minor changes to the ChaCha asm. When you have time, run it again so I can determine which test case is actually failing.
Thanks, Sean
By the way, which model of the Pi do you have? May help to know which CPU you are using.
@douzzer Do you know which CPU is being emulated?
When you have time, run it again so I can determine which test case is actually failing.
Everything passed with the latest changes. :)
By the way, which model of the Pi do you have?
pi@raspberrypi:~ $ lscpu
Architecture: armv7l
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Vendor ID: ARM
Model: 3
Model name: Cortex-A72
Stepping: r0p3
CPU max MHz: 1500.0000
CPU min MHz: 600.0000
BogoMIPS: 108.00
Flags: half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32
Bizarre! I changed the code to use simpler instructions and that was the difference! Which version of the Raspberry Pi is it? v4 from the specs.
Which version of the Raspberry Pi is it? v4 from the specs
yeah vfpv4
signifies RPi4
Using 'neon' works for me with QEMU so I've changed configure.ac
Thanks, that should get non-RPi armv7 to compile with neon instructions as well.
FYI older RPi (than latest 2B+ & 3's) only have vfp
and vfpv2
(thats the model 2B & older 2B+). All RPi have vfp
. Non-RPi will only have vfp
if its a Broadcom VC based Arm CPU. neon
works for any Arm processor with a compliant GPU
I did not use the "Begin Review", but wanted to review the code to see what sort of "simplified assember instruction" you were talking about, but came across these:
if BUILD_ARMASM_INLINE
which seemed to be flipped (unless I dont correctly evaluate Make use of that statement)
The implied use would be that you do want to use the Asm (.S) version (ie. use inlined assembler instructions), unless BUILD_ARMASM_INLINE
is defined but not set or set to 0
(ie. use compiler generated assembler from C code) - correct?
Just a Question: is the intention to still inline the code either way? (I might presume so, if speed is of the essence, without knowing what the rest of the code does) Or does "inline" here mean "use code in compiler output" as opposed to "inline code in assembly output".
EDIT: WOLFSSL_ARMASM_NO_CRYPTO
here means no Crypto in hardware right?
Which version of the Raspberry Pi is it? v4 from the specs
yeah
vfpv4
signifies RPi4
Indeed. RPi 4 Model B.
BUILD_ARMASM_INLINE means the assembly code is inlined in the C code file. There is no difference between the two files except that one can be compiled with a C compiler and the other needs go through the assembler.
WOLFSSL_ARMASM_NO_CRYPTO means the CPU doesn't support cryptographic instructions. Therefore assembly implementations not using cryptographic instructions have been introduced. Given the obvious strangeness of the define name I've changed it to WOLFSSL_ARMASM_NO_HW_CRYPTO.
@douzzer Could you try again? I've changed the -mfpu option and it might work now.
Thanks, Sean
Not all ARMv7a CPUs support all NEON instructions. Define WOLFSSL_ARM_ARCH_NO_VREV when vrev not available. I removed usage of VTRN in a previous commit.
@SparkiDev thanks for clarification .. you are almost there by the looks of it
BTW vfp4
works for the CM4 too (yes?)
Found a reference saying that Cortex-A9 and older CPUs only support 64-bit registers and not 128. Use WOLFSSL_ARM_ARCH_NEON_64BIT to indicate this. (Don't use VREV define anymore.)
@SparkiDev pulled latest PR 5152 I am still seeing the same issues:
./configure -host=armv7a --enable-armasm --enable-debug --disable-shared CFLAGS="-fomit-frame-pointer -DWOLFSSL_ARM_ARCH_NEON_64BIT" && make
...
gdb ./tests/unit.test
...
Program received signal SIGBUS, Bus error.
[Switching to Thread 0x76dbf440 (LWP 6759)]
Transform_Sha256_Len () at wolfcrypt/src/port/arm/armv8-32-sha256-asm.S:1563
1563 vrev32.8 q0, q0
./configure -host=armv7a --enable-armasm --enable-debug --disable-shared CFLAGS="-fomit-frame-pointer -DWOLFSSL_ARM_ARCH_NEON_64BIT -DWOLFSSL_ARM_ARCH_NO_VREV" && make
...
gdb ./tests/unit.test
...
Program received signal SIGBUS, Bus error.
[Switching to Thread 0x76dbf440 (LWP 12093)]
Transform_Sha256_Len () at wolfcrypt/src/port/arm/armv8-32-sha256-asm.S:1568
1568 vshl.i16 q4, q0, #8
The define probably isn't reaching the assembly code!
The define probably isn't reaching the assembly code!
Good point. Trying again with ASFLAGS
Still not working for me. Any suggestions?
./configure -host=armv7a --enable-armasm --enable-debug --disable-shared AM_CFLAGS="-fomit-frame-pointer -DWOLFSSL_ARM_ARCH_NEON_64BIT" AM_CCASFLAGS="-fomit-frame-pointer -DWOLFSSL_ARM_ARCH_NEON_64BIT" && make
...
wolfcrypt/src/port/arm/armv8-chacha.c: In function 'wc_Chacha_encrypt_256':
wolfcrypt/src/port/arm/armv8-chacha.c:1383:1: error: fp cannot be used in asm here
}
^
CC wolfcrypt/src/src_libwolfssl_la-chacha20_poly1305.lo
./configure -host=armv7a --enable-armasm AM_CFLAGS="-DWOLFSSL_ARM_ARCH_NEON_64BIT" AM_CCASFLAGS="-DWOLFSSL_ARM_ARCH_NEON_64BIT" && make
$ ./tests/unit.test
starting unit tests...
Begin API Tests
wolfSSL_Init(): passed
test_wolfSSL_ERR_strings: passed
wolfSSL_CTX_use_certificate_buffer(): passed
In verification callback, error = 0, unknown error number
Peer certs: 1
Subject's domain name at 0 is www.wolfssl.com
In verification callback, error = -188, ASN no signer error to confirm failure
Peer certs: 1
Subject's domain name at 0 is www.wolfssl.com
Allowing failed certificate check, testing only (shouldn't do this in production)
test_CertRsaPss: passed
Bus error
git log
commit 2c4c7ba6dad7b286fb14e9cf37c6bfec02f0d890
Author: Sean Parkinson <[email protected]>
Date: Mon Sep 12 10:00:18 2022 +1000
ARM v7a ASM: 128-bit registers not supported
Cortex-A5 - Cortex-A9 only support 64-bit wide NEON.
Remove use of WOLFSSL_ARM_ARCH_NO_VREV.
Use WOLFSSL_ARM_ARCH_NEON_64BIT to indicate to use 64-bit NEON registers
and not 128-bit NEON registers.
Try without debug for now.
Try without debug for now.
I did. It was in the log. Still his fault
@SparkiDev I pulled latest. Still see the same bus error without debug. With debug same "fp cannot be used" error.
Tested using:
./configure -host=armv7a --enable-armasm AM_CFLAGS="-DWOLFSSL_ARM_ARCH_NEON_64BIT" AM_CCASFLAGS="-DWOLFSSL_ARM_ARCH_NEON_64BIT" && make
.
AND
./configure -host=armv7a --enable-armasm --enable-debug --disable-shared AM_CFLAGS="-fomit-frame-pointer -DWOLFSSL_ARM_ARCH_NEON_64BIT" AM_CCASFLAGS="-fomit-frame-pointer -DWOLFSSL_ARM_ARCH_NEON_64BIT" && make
@dgarske Can you tell be which file the compiler is complaining about with the use of fp? I've eliminated the error with the my QEMU build when DEBUG is defined and not NDEBUG in armv8-chacha.c.
@dgarske Can you tell be which file the compiler is complaining about with the use of fp? I've eliminated the error with the my QEMU build when DEBUG is defined and not NDEBUG in armv8-chacha.c.
It was armv8-chacha.c:1383
. See my comments above.