wolfssl icon indicating copy to clipboard operation
wolfssl copied to clipboard

ARM ASM: ARMv7a with NEON instructions

Open SparkiDev opened this issue 2 years ago • 31 comments

Description

Change to build assembly code for ARMv7a with NEON instruction set. ./configure -host=armv7a --enable-armasm Added ARM32 SHA-256 NEON only implementation.

Testing

Tested SHA-256 NEON implementation using QEMU.

Checklist

  • [ ] added tests
  • [ ] updated/added doxygen
  • [ ] updated appropriate READMEs
  • [ ] Updated manual and documentation

SparkiDev avatar May 18 '22 06:05 SparkiDev

retest this please

SparkiDev avatar Aug 01 '22 22:08 SparkiDev

@SparkiDev I've just done some testing on this.

First thing I had to change the armv7a to armv7a* in configure.ac L2050, to accommodate the standard target tuple armv7a-unknown-linux-gnueabihf.

That done, to get --enable-armasm to build, I had to include --disable-chacha --disable-xchacha --disable-poly1305, because of errors like this:

/tmp/tmp.4346_8721/ccFSpc8A.s: Assembler messages:
/tmp/tmp.4346_8721/ccFSpc8A.s:270: Error: first transfer register must be even -- `ldrd r11,r10,[r14,#4*14]'
make[2]: *** [Makefile:5938: wolfcrypt/src/port/arm/src_libwolfssl_la-armv8-chacha.lo] Error 1

Once I did that, testsuite.test nominally succeeded, but crashed at exit:

[...]
mutex    test passed!
memcb    test passed!
Test complete
qemu: uncaught target signal 7 (Bus error) - core dumped

(This was repeatable.)

Other bits:

With -pedantic, this happens:

wolfcrypt/src/sha256.c:2002: error: ISO C forbids an empty translation unit [-Werror=pedantic]

Looks like an include.am oversight.

And finally, --enable-all --enable-armasm on armv7a breaks as follows:

wolfcrypt/src/port/arm/armv8-sha256.c: In function ‘wc_Sha256Transform’:
089ebf277f (<[email protected]> 2021-03-23 12:53:06 +1000 1540)     Sha256Transform(sha256, data, 1);
wolfcrypt/src/port/arm/armv8-sha256.c:1540:5: error: implicit declaration of function ‘Sha256Transform’; did you mean ‘wc_Sha256Transform’? [-Werror=implicit-function-declaration]
 1540 |     Sha256Transform(sha256, data, 1);
      |     ^~~~~~~~~~~~~~~
      |     wc_Sha256Transform
089ebf277f (<[email protected]> 2021-03-23 12:53:06 +1000 1540)     Sha256Transform(sha256, data, 1);
wolfcrypt/src/port/arm/armv8-sha256.c:1540:5: error: nested extern declaration of ‘Sha256Transform’ [-Werror=nested-externs]

This just has to be related to the wholesale replacement of sha256.c that happens for armasm.

The toolchain versions for the above:

[I-O] [  ] cross-armv7a-unknown-linux-gnueabihf/binutils-2.38-r2:2.38
[I-O] [  ] cross-armv7a-unknown-linux-gnueabihf/gcc-12.1.1_p20220625:12
[I-O] [  ] cross-armv7a-unknown-linux-gnueabihf/linux-headers-5.19:0
[I-O] [  ] cross-armv7a-unknown-linux-gnueabihf/glibc-2.35-r8:2.2
[IP-] [  ] app-emulation/qemu-7.0.0-r3:0

douzzer avatar Aug 05 '22 18:08 douzzer

Had to disable the same things as @douzzer. But, I was able to get some benchmarking results on my Raspberry Pi:

pi@raspberrypi:~/wolfssl $ lscpu; cat before_sha256.txt; cat after_sha256.txt; cat before_sha512.txt; cat after_sha512.txt 
Architecture:        armv7l
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           1
Vendor ID:           ARM
Model:               3
Model name:          Cortex-A72
Stepping:            r0p3
CPU max MHz:         1500.0000
CPU min MHz:         600.0000
BogoMIPS:            108.00
Flags:               half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32
------------------------------------------------------------------------------
 wolfSSL version 5.4.0
------------------------------------------------------------------------------
wolfCrypt Benchmark (block bytes 1048576, min 1.0 sec each)
SHA-256             80 MB took 1.008 seconds,   79.339 MB/s
Benchmark complete
------------------------------------------------------------------------------
 wolfSSL version 5.4.0
------------------------------------------------------------------------------
wolfCrypt Benchmark (block bytes 1048576, min 1.0 sec each)
SHA-256            120 MB took 1.020 seconds,  117.677 MB/s
Benchmark complete
------------------------------------------------------------------------------
 wolfSSL version 5.4.0
------------------------------------------------------------------------------
wolfCrypt Benchmark (block bytes 1048576, min 1.0 sec each)
SHA-512             40 MB took 1.019 seconds,   39.273 MB/s
Benchmark complete
------------------------------------------------------------------------------
 wolfSSL version 5.4.0
------------------------------------------------------------------------------
wolfCrypt Benchmark (block bytes 1048576, min 1.0 sec each)
SHA-512             75 MB took 1.064 seconds,   70.472 MB/s
Benchmark complete

Looking quite a bit faster than before!

OpenSSL numbers:

pi@raspberrypi:~/wolfssl $ cat openssl_sha256.txt; echo ""; cat openssl_sha512.txt 
Doing sha256 for 3s on 1048576 size blocks: 469 sha256's in 3.00s
OpenSSL 1.1.1d  10 Sep 2019
built on: Fri Jan 31 15:37:19 2020 UTC
options:bn(64,32) rc4(char) des(long) aes(partial) blowfish(ptr) 
compiler: gcc -fPIC -pthread -Wa,--noexecstack -Wall -D__ARM_MAX_ARCH__=7 -Wa,--noexecstack -g -O2 -fdebug-prefix-map=/build/openssl-ueKbAp/openssl-1.1.1d=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DAES_ASM -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
The 'numbers' are in 1000s of bytes per second processed.
type        1048576 bytes
sha256          163927.38k

Doing sha512 for 3s on 1048576 size blocks: 270 sha512's in 3.00s
OpenSSL 1.1.1d  10 Sep 2019
built on: Fri Jan 31 15:37:19 2020 UTC
options:bn(64,32) rc4(char) des(long) aes(partial) blowfish(ptr) 
compiler: gcc -fPIC -pthread -Wa,--noexecstack -Wall -D__ARM_MAX_ARCH__=7 -Wa,--noexecstack -g -O2 -fdebug-prefix-map=/build/openssl-ueKbAp/openssl-1.1.1d=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_USE_NODELETE -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DAES_ASM -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
The 'numbers' are in 1000s of bytes per second processed.
type        1048576 bytes
sha512           94371.84k

haydenroche5 avatar Aug 23 '22 04:08 haydenroche5

just tried this rebased on latest master, with the configure.ac tweak to recognize armv7a-unknown-linux-gnueabihf:

    autogen.sh 08b4fd6f2e-dirty...   real 0m14.917s  user 0m13.195s  sys 0m0.859s
    configure...   real 0m8.970s  user 0m4.785s  sys 0m5.044s
    build...   real 0m30.472s  user 2m5.040s  sys 0m6.234s
    testsuite.test...   real 0m28.347s  user 0m28.304s  sys 0m0.014s
testsuite.test for scenario cross-armv7a-all-armasm exited with status 135.
================================================================================
TLSv1.3 KDF test passed!
X963-KDF    test passed!
GMAC     test passed!
ARC4     test passed!
DES      test passed!
DES3     test passed!
AES      test passed!
AES192   test passed!
AES256   test passed!
AESOFB   test passed!
AES-GCM  test passed!
AES-CCM  test passed!
AES Key Wrap test passed!
AES-SIV  test passed!
CAMELLIA test passed!
RSA NOPAD test passed!
RSA      test passed!
DH       test passed!
DSA      test passed!
qemu: uncaught target signal 7 (Bus error) - core dumped
[... retries ...]
    cross-armv7a-all-armasm fail_check
    failed config: '--enable-all' 'CPPFLAGS=-DNO_WOLFSSL_CIPHER_SUITE_TEST -DWOLFSSL_OLD_PRIME_CHECK -pedantic' '--enable-asn=template' '--enable-armasm' '--disable-chacha' '--disable-poly1305' '--disable-xchacha' '--host=armv7a-unknown-linux-gnueabihf' 'FILECMD=/bin/false' 'MANIFEST_TOOL=/bin/false'

douzzer avatar Aug 25 '22 15:08 douzzer

@douzzer

Can you please try without SRP (--disable-srp). I'm not seeing any issue with QEMU on Linux. And also try to enable ChaCha, Poly1305 and XChaCha.

Thanks, Sean

SparkiDev avatar Sep 02 '22 00:09 SparkiDev

shouldn't -mfpu=neon-vfpv3 just be plain -mfpu=neon or -mfpu=neon-vfp, neon being a generic Arm processor, neon-vfp being certain types of Arm processor, and neon-vfpv3 being specific Arm processor (which exludes v1, v2 & v4)?

paulwratt avatar Sep 02 '22 09:09 paulwratt

Hi @paulwratt,

I have found it difficult in the past to get the right mfpu setting that will work! Using 'neon' works for me with QEMU so I've changed configure.ac. Note that the appropriate mfpu setting should be decided by each customer based on their hardware.

Thanks, Sean

SparkiDev avatar Sep 05 '22 22:09 SparkiDev

Tested with ./configure --host=armv7a --enable-armasm on Raspberry Pi. No build errors, now, but testwolfcrypt is failing:

pi@raspberrypi:~/wolfssl $ ./wolfcrypt/test/testwolfcrypt 
------------------------------------------------------------------------------
 wolfSSL version 5.5.0
------------------------------------------------------------------------------
error    test passed!
MEMORY   test passed!
base64   test passed!
asn      test passed!
RANDOM   test passed!
MD5      test passed!
SHA      test passed!
SHA-256  test passed!
SHA-384  test passed!
SHA-512  test passed!
Hash     test passed!
HMAC-MD5 test passed!
HMAC-SHA test passed!
HMAC-SHA256 test passed!
HMAC-SHA384 test passed!
HMAC-SHA512 test passed!
HMAC-KDF    test passed!
TLSv1.3 KDF test passed!
GMAC     test passed!
Chacha   test failed!
 error = -4726
Exiting main with return code: -1

haydenroche5 avatar Sep 06 '22 17:09 haydenroche5

@haydenroche5 The error number is not unique. I've updated the test and made minor changes to the ChaCha asm. When you have time, run it again so I can determine which test case is actually failing.

Thanks, Sean

SparkiDev avatar Sep 06 '22 23:09 SparkiDev

By the way, which model of the Pi do you have? May help to know which CPU you are using.

SparkiDev avatar Sep 06 '22 23:09 SparkiDev

@douzzer Do you know which CPU is being emulated?

SparkiDev avatar Sep 06 '22 23:09 SparkiDev

When you have time, run it again so I can determine which test case is actually failing.

Everything passed with the latest changes. :)

By the way, which model of the Pi do you have?

pi@raspberrypi:~ $ lscpu
Architecture:        armv7l
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           1
Vendor ID:           ARM
Model:               3
Model name:          Cortex-A72
Stepping:            r0p3
CPU max MHz:         1500.0000
CPU min MHz:         600.0000
BogoMIPS:            108.00
Flags:               half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32

haydenroche5 avatar Sep 07 '22 01:09 haydenroche5

Bizarre! I changed the code to use simpler instructions and that was the difference! Which version of the Raspberry Pi is it? v4 from the specs.

SparkiDev avatar Sep 07 '22 01:09 SparkiDev

Which version of the Raspberry Pi is it? v4 from the specs

yeah vfpv4 signifies RPi4

Using 'neon' works for me with QEMU so I've changed configure.ac

Thanks, that should get non-RPi armv7 to compile with neon instructions as well.

FYI older RPi (than latest 2B+ & 3's) only have vfp and vfpv2 (thats the model 2B & older 2B+). All RPi have vfp. Non-RPi will only have vfp if its a Broadcom VC based Arm CPU. neon works for any Arm processor with a compliant GPU

paulwratt avatar Sep 07 '22 10:09 paulwratt

I did not use the "Begin Review", but wanted to review the code to see what sort of "simplified assember instruction" you were talking about, but came across these: if BUILD_ARMASM_INLINE which seemed to be flipped (unless I dont correctly evaluate Make use of that statement)

The implied use would be that you do want to use the Asm (.S) version (ie. use inlined assembler instructions), unless BUILD_ARMASM_INLINE is defined but not set or set to 0 (ie. use compiler generated assembler from C code) - correct?

Just a Question: is the intention to still inline the code either way? (I might presume so, if speed is of the essence, without knowing what the rest of the code does) Or does "inline" here mean "use code in compiler output" as opposed to "inline code in assembly output".

EDIT: WOLFSSL_ARMASM_NO_CRYPTO here means no Crypto in hardware right?

paulwratt avatar Sep 07 '22 10:09 paulwratt

Which version of the Raspberry Pi is it? v4 from the specs

yeah vfpv4 signifies RPi4

Indeed. RPi 4 Model B.

haydenroche5 avatar Sep 07 '22 19:09 haydenroche5

BUILD_ARMASM_INLINE means the assembly code is inlined in the C code file. There is no difference between the two files except that one can be compiled with a C compiler and the other needs go through the assembler.

WOLFSSL_ARMASM_NO_CRYPTO means the CPU doesn't support cryptographic instructions. Therefore assembly implementations not using cryptographic instructions have been introduced. Given the obvious strangeness of the define name I've changed it to WOLFSSL_ARMASM_NO_HW_CRYPTO.

SparkiDev avatar Sep 07 '22 22:09 SparkiDev

@douzzer Could you try again? I've changed the -mfpu option and it might work now.

Thanks, Sean

SparkiDev avatar Sep 08 '22 01:09 SparkiDev

Not all ARMv7a CPUs support all NEON instructions. Define WOLFSSL_ARM_ARCH_NO_VREV when vrev not available. I removed usage of VTRN in a previous commit.

SparkiDev avatar Sep 09 '22 00:09 SparkiDev

@SparkiDev thanks for clarification .. you are almost there by the looks of it

BTW vfp4 works for the CM4 too (yes?)

paulwratt avatar Sep 11 '22 11:09 paulwratt

Found a reference saying that Cortex-A9 and older CPUs only support 64-bit registers and not 128. Use WOLFSSL_ARM_ARCH_NEON_64BIT to indicate this. (Don't use VREV define anymore.)

SparkiDev avatar Sep 12 '22 00:09 SparkiDev

@SparkiDev pulled latest PR 5152 I am still seeing the same issues:

./configure -host=armv7a --enable-armasm --enable-debug --disable-shared CFLAGS="-fomit-frame-pointer -DWOLFSSL_ARM_ARCH_NEON_64BIT" && make
...
gdb ./tests/unit.test
...
Program received signal SIGBUS, Bus error.
[Switching to Thread 0x76dbf440 (LWP 6759)]
Transform_Sha256_Len () at wolfcrypt/src/port/arm/armv8-32-sha256-asm.S:1563
1563		vrev32.8	q0, q0
./configure -host=armv7a --enable-armasm --enable-debug --disable-shared CFLAGS="-fomit-frame-pointer -DWOLFSSL_ARM_ARCH_NEON_64BIT -DWOLFSSL_ARM_ARCH_NO_VREV" && make
...
gdb ./tests/unit.test
...
Program received signal SIGBUS, Bus error.
[Switching to Thread 0x76dbf440 (LWP 12093)]
Transform_Sha256_Len () at wolfcrypt/src/port/arm/armv8-32-sha256-asm.S:1568
1568		vshl.i16	q4, q0, #8

dgarske avatar Sep 12 '22 21:09 dgarske

The define probably isn't reaching the assembly code!

SparkiDev avatar Sep 12 '22 22:09 SparkiDev

The define probably isn't reaching the assembly code!

Good point. Trying again with ASFLAGS

dgarske avatar Sep 12 '22 22:09 dgarske

Still not working for me. Any suggestions?

./configure -host=armv7a --enable-armasm --enable-debug --disable-shared AM_CFLAGS="-fomit-frame-pointer -DWOLFSSL_ARM_ARCH_NEON_64BIT" AM_CCASFLAGS="-fomit-frame-pointer -DWOLFSSL_ARM_ARCH_NEON_64BIT" && make
...
wolfcrypt/src/port/arm/armv8-chacha.c: In function 'wc_Chacha_encrypt_256':
wolfcrypt/src/port/arm/armv8-chacha.c:1383:1: error: fp cannot be used in asm here
 }
 ^
  CC       wolfcrypt/src/src_libwolfssl_la-chacha20_poly1305.lo
./configure -host=armv7a --enable-armasm AM_CFLAGS="-DWOLFSSL_ARM_ARCH_NEON_64BIT" AM_CCASFLAGS="-DWOLFSSL_ARM_ARCH_NEON_64BIT" && make
 $ ./tests/unit.test
starting unit tests...
 Begin API Tests
   wolfSSL_Init(): passed
   test_wolfSSL_ERR_strings: passed
   wolfSSL_CTX_use_certificate_buffer(): passed
In verification callback, error = 0, unknown error number
	Peer certs: 1
	Subject's domain name at 0 is www.wolfssl.com
In verification callback, error = -188, ASN no signer error to confirm failure
	Peer certs: 1
	Subject's domain name at 0 is www.wolfssl.com
	Allowing failed certificate check, testing only (shouldn't do this in production)
   test_CertRsaPss: passed
Bus error
git log
commit 2c4c7ba6dad7b286fb14e9cf37c6bfec02f0d890
Author: Sean Parkinson <[email protected]>
Date:   Mon Sep 12 10:00:18 2022 +1000

    ARM v7a ASM: 128-bit registers not supported

    Cortex-A5 - Cortex-A9 only support 64-bit wide NEON.
    Remove use of WOLFSSL_ARM_ARCH_NO_VREV.
    Use WOLFSSL_ARM_ARCH_NEON_64BIT to indicate to use 64-bit NEON registers
    and not 128-bit NEON registers.

dgarske avatar Sep 12 '22 22:09 dgarske

Try without debug for now.

SparkiDev avatar Sep 12 '22 23:09 SparkiDev

Try without debug for now.

I did. It was in the log. Still his fault

dgarske avatar Sep 12 '22 23:09 dgarske

@SparkiDev I pulled latest. Still see the same bus error without debug. With debug same "fp cannot be used" error.

Tested using: ./configure -host=armv7a --enable-armasm AM_CFLAGS="-DWOLFSSL_ARM_ARCH_NEON_64BIT" AM_CCASFLAGS="-DWOLFSSL_ARM_ARCH_NEON_64BIT" && make. AND ./configure -host=armv7a --enable-armasm --enable-debug --disable-shared AM_CFLAGS="-fomit-frame-pointer -DWOLFSSL_ARM_ARCH_NEON_64BIT" AM_CCASFLAGS="-fomit-frame-pointer -DWOLFSSL_ARM_ARCH_NEON_64BIT" && make

dgarske avatar Sep 16 '22 18:09 dgarske

@dgarske Can you tell be which file the compiler is complaining about with the use of fp? I've eliminated the error with the my QEMU build when DEBUG is defined and not NDEBUG in armv8-chacha.c.

SparkiDev avatar Sep 19 '22 00:09 SparkiDev

@dgarske Can you tell be which file the compiler is complaining about with the use of fp? I've eliminated the error with the my QEMU build when DEBUG is defined and not NDEBUG in armv8-chacha.c.

It was armv8-chacha.c:1383. See my comments above.

dgarske avatar Sep 19 '22 15:09 dgarske