CMSIS_5 icon indicating copy to clipboard operation
CMSIS_5 copied to clipboard

CMSIS-NN code lead to low efficiency unnecessary memcpy calls all over the library

Open RockySong opened this issue 2 years ago • 3 comments

CMSIS-NN NN functions heavily rely on some 4-byte accessor functions, such as:

__STATIC_FORCEINLINE q31_t arm_nn_read_q7x4_ia(const q7_t **in_q7) { q31_t val; memcpy(&val, *in_q7, 4); *in_q7 += 4;

return (val);

} However, the data point is not naturally aligned to 4 bytes., and such memcpy calls are not optimized to “LDR %[val], [%[ptr]], #4” instruction, so does arm_nn_read_q15x2_ia. They slow down the whole NN functions for at least 2x times! Suggest NN functions to check if input buffers and weight buffers are aligned to 4 bytes, and create inline asm code to generate LDR/STR instructions, such as

__STATIC_FORCEINLINE q31_t arm_nn_read_q7x4_ia_aligned(const q7_t **in_q7) { const void *pDat = *in_q7; q31_t val; asm volatile("ldr %[val], [%[pDat]], #4\n" : [val] "=r" (val) , [pDat] "+r" (pDat) : :);

return (val);

}

RockySong avatar May 26 '22 08:05 RockySong

@RockySong .. Thanks for the question. For use with TensorFlow Lite for Microcontrollers the input/weight buffers are at least word aligned, but unaligned access is something that mostly a consequence of model hyper parameters. For e.g say you have a tensor with a dimension of 12x12x13(WxHxC_IN) for a 1x1 convolution then that results in unaligned access as a new row starts every 13th byte. On top of this there are multiple data types that are involved as CMSIS-NN isn't a standalone system.

https://github.com/ARM-software/CMSIS_5/issues/546 is the base ticket based on which we switched to memcpy from type pruned pointer access which resulted in undefined behavior. Here is a sample PR with the changes https://github.com/ARM-software/CMSIS_5/commit/61d15e506c0a681c2352b87979a2e685c1c0f7ab

I would expect a run time check on hyper parameters would be expensive to decide on different versions of word read.

Could you provide information on a sample tensor dimension, processor used and the optimization level where you notice this 2x drop in performance?

felix-johnny avatar May 30 '22 14:05 felix-johnny

Unaligned accesses are probably not the problem. They are accepted on most recent Cortex-M architectures : https://developer.arm.com/documentation/dui0473/j/using-the-assembler/address-alignment

So, memcpy should be optimized out by the compiler and replaced by LDR / STR except if some compilation options like -fno-builtin are used

In CMSIS-DSP library (where we have similar memory accesses functions) we are testing for __ARM_FEATURE_UNALIGNED with a #if defined

With armclang, this is defined to 1 when unaligned accesses are allowed.

With gcc, you may have to define -munaligned-access because in gcc specs I am seeing:

Enables (or disables) reading and writing of 16- and 32- bit values from addresses that are not 16- or 32- bit aligned. By default unaligned access is disabled for all pre-ARMv6, all ARMv6-M and for ARMv8-M Baseline architectures, and enabled for all other architectures. If unaligned access is not enabled then words in packed data structures are accessed a byte at a time.

So, the feature may not be enabled in the compiler even in cases where it is supported in the core.

christophe0606 avatar Jun 10 '22 05:06 christophe0606

Thank you all for your kind suggestions! After I look for the compiler options, I identified that the "-fno-builtin" and "-ffreestanding" prevent the armclang and gcc to generate LDR/STR for memcpy calls with length of 4. BTW, it seems the SXT16 and data load, Im2Col limits the MAC rate is up to 0.4MAC/cycle on M33 or 0.6MAC/cycle on M7, while if all instructions are "SMLAD", the theoretical limit is 2MAC/cycle.

RockySong avatar Jun 11 '22 03:06 RockySong