wolfssl icon indicating copy to clipboard operation
wolfssl copied to clipboard

Update to copy unaligned data into buffer during sha256

Open billphipps opened this issue 2 years ago • 3 comments

Description

When aarch64 optimized assembly is used, the ld1 instruction is used to directly read data from a user-supplied data buffer during the sha256 update, which may be unaligned on a 64-bit boundary especially during wolfBoot. This correction detects non-aligned data pointers and XMEMCPY's the data into the aligned sha256->buffer structure, which roughly approximates the same operations the hardware would perform, but this avoids any unaligned 64-bit reads from memory. This specifically corrects accesses to on-chip RAM (OCRAM) on ARMv8 processors that must be mapped as Device instead of Normal memory, as all unaligned accesses to Device memory generate a synchronous unaligned data exception.

Testing

This change was installed on an NXP LS1028A board with wolfBoot executing out of OCRAM. Without this correction, all boot attempts with ARMASM enabled failed as the size of the image header is 44 bytes, causing the initial image data pointer to be unaligned.

Checklist

  • [ None ] added tests - triggering this error would require a specific hardware setup. No additional tests necessary.
  • [ N/A ] updated/added doxygen
  • [ N/A ] updated appropriate READMEs
  • [ N/A] Updated manual and documentation

billphipps avatar May 31 '23 17:05 billphipps

On a Cortex-A53:

Before this PR:

# /tmp/benchmark -sha256
------------------------------------------------------------------------------
 wolfSSL version 5.6.0
------------------------------------------------------------------------------
Math:   Multi-Precision: Wolf(SP) word-size=64 bits=4096 sp_int.c
wolfCrypt Benchmark (block bytes 1048576, min 1.0 sec each)
SHA-256 updates on pointer 359795c040
SHA-256                    575 MiB took 1.005 seconds,  572.139 MiB/s
SHA-256 updates on pointer 3597b5e041
SHA-256                    545 MiB took 1.006 seconds,  541.750 MiB/s

After this PR:

# /tmp/benchmark -sha256
------------------------------------------------------------------------------
 wolfSSL version 5.6.0
------------------------------------------------------------------------------
Math:   Multi-Precision: Wolf(SP) word-size=64 bits=4096 sp_int.c
wolfCrypt Benchmark (block bytes 1048576, min 1.0 sec each)
SHA-256 updates on pointer 439a639040
SHA-256                    575 MiB took 1.005 seconds,  572.139 MiB/s
SHA-256 updates on pointer 439a83b041
SHA-256                    185 MiB took 1.009 seconds,  183.350 MiB/s

Can we make this dependent on a specific architecture or macro flag?

JacobBarthelmeh avatar May 31 '23 20:05 JacobBarthelmeh

On a Cortex-A53:

Before this PR:

# /tmp/benchmark -sha256
------------------------------------------------------------------------------
 wolfSSL version 5.6.0
------------------------------------------------------------------------------
Math:   Multi-Precision: Wolf(SP) word-size=64 bits=4096 sp_int.c
wolfCrypt Benchmark (block bytes 1048576, min 1.0 sec each)
SHA-256 updates on pointer 359795c040
SHA-256                    575 MiB took 1.005 seconds,  572.139 MiB/s
SHA-256 updates on pointer 3597b5e041
SHA-256                    545 MiB took 1.006 seconds,  541.750 MiB/s

After this PR:

# /tmp/benchmark -sha256
------------------------------------------------------------------------------
 wolfSSL version 5.6.0
------------------------------------------------------------------------------
Math:   Multi-Precision: Wolf(SP) word-size=64 bits=4096 sp_int.c
wolfCrypt Benchmark (block bytes 1048576, min 1.0 sec each)
SHA-256 updates on pointer 439a639040
SHA-256                    575 MiB took 1.005 seconds,  572.139 MiB/s
SHA-256 updates on pointer 439a83b041
SHA-256                    185 MiB took 1.009 seconds,  183.350 MiB/s

Can we make this dependent on a specific architecture or macro flag?

Yup! I was thinking the exact same thing, but got stuck deciding between providing an ARMASM-specific flag (like WOLFSSL_ARMASM_FORCE_ALIGNMENT) versus using the global WOLFSSL_USE_ALIGN (which really only implies static alignment). In my case, I'm leaning towards using WOLFSSL_USE_ALIGN as this would match how the XILINX offload works as well, but it could inadvertently cause performance problems in cases that don't need the alignment for these operations. Any opinions before I push again?

billphipps avatar Jun 01 '23 15:06 billphipps

Looks like in types.h the macro WOLFSSL_ARMASM turns on WOLFSSL_USE_ALIGN. Maybe the macro WC_HASH_DATA_ALIGNMENT would be a good fit here like with the MMCAU code section.

JacobBarthelmeh avatar Jun 01 '23 22:06 JacobBarthelmeh