CMSIS_5
CMSIS_5 copied to clipboard
CMSIS-NN code lead to low efficiency unnecessary memcpy calls all over the library
CMSIS-NN NN functions heavily rely on some 4-byte accessor functions, such as:
__STATIC_FORCEINLINE q31_t arm_nn_read_q7x4_ia(const q7_t **in_q7) { q31_t val; memcpy(&val, *in_q7, 4); *in_q7 += 4;
return (val);
} However, the data point is not naturally aligned to 4 bytes., and such memcpy calls are not optimized to “LDR %[val], [%[ptr]], #4” instruction, so does arm_nn_read_q15x2_ia. They slow down the whole NN functions for at least 2x times! Suggest NN functions to check if input buffers and weight buffers are aligned to 4 bytes, and create inline asm code to generate LDR/STR instructions, such as
__STATIC_FORCEINLINE q31_t arm_nn_read_q7x4_ia_aligned(const q7_t **in_q7) { const void *pDat = *in_q7; q31_t val; asm volatile("ldr %[val], [%[pDat]], #4\n" : [val] "=r" (val) , [pDat] "+r" (pDat) : :);
return (val);
}
@RockySong .. Thanks for the question. For use with TensorFlow Lite for Microcontrollers the input/weight buffers are at least word aligned, but unaligned access is something that mostly a consequence of model hyper parameters. For e.g say you have a tensor with a dimension of 12x12x13(WxHxC_IN) for a 1x1 convolution then that results in unaligned access as a new row starts every 13th byte. On top of this there are multiple data types that are involved as CMSIS-NN isn't a standalone system.
https://github.com/ARM-software/CMSIS_5/issues/546 is the base ticket based on which we switched to memcpy from type pruned pointer access which resulted in undefined behavior. Here is a sample PR with the changes https://github.com/ARM-software/CMSIS_5/commit/61d15e506c0a681c2352b87979a2e685c1c0f7ab
I would expect a run time check on hyper parameters would be expensive to decide on different versions of word read.
Could you provide information on a sample tensor dimension, processor used and the optimization level where you notice this 2x drop in performance?
Unaligned accesses are probably not the problem. They are accepted on most recent Cortex-M architectures : https://developer.arm.com/documentation/dui0473/j/using-the-assembler/address-alignment
So, memcpy
should be optimized out by the compiler and replaced by LDR / STR except if some compilation options like -fno-builtin
are used
In CMSIS-DSP library (where we have similar memory accesses functions) we are testing for __ARM_FEATURE_UNALIGNED
with a #if defined
With armclang, this is defined to 1 when unaligned accesses are allowed.
With gcc, you may have to define -munaligned-access
because in gcc specs I am seeing:
Enables (or disables) reading and writing of 16- and 32- bit values from addresses that are not 16- or 32- bit aligned. By default unaligned access is disabled for all pre-ARMv6, all ARMv6-M and for ARMv8-M Baseline architectures, and enabled for all other architectures. If unaligned access is not enabled then words in packed data structures are accessed a byte at a time.
So, the feature may not be enabled in the compiler even in cases where it is supported in the core.
Thank you all for your kind suggestions! After I look for the compiler options, I identified that the "-fno-builtin" and "-ffreestanding" prevent the armclang and gcc to generate LDR/STR for memcpy calls with length of 4. BTW, it seems the SXT16 and data load, Im2Col limits the MAC rate is up to 0.4MAC/cycle on M33 or 0.6MAC/cycle on M7, while if all instructions are "SMLAD", the theoretical limit is 2MAC/cycle.