tinyengine icon indicating copy to clipboard operation
tinyengine copied to clipboard

TinyEngine convolutional layer has greater latency than ARM's CMSIS-NN

Open ellial opened this issue 1 year ago • 1 comments

Hello,

I was measuring the latency on one of TinyEngine's convolutional kernels (convolve_s8_kernel3_stride1_pad1) versus CMSIS-NN's fast convolutional kernel (arm_convolve_HWC_q7_fast). The TinyEngine kernel had a latency of appx. 200000 cycles while the CMSIS kernel had a latency of appx. 130000 cycles.

  • Is the additional overhead due to the per channel requantization of Tiny Engine? Could you explain why per channel requantization is needed in the kernel?
  • Have you tried benchmarking the latencies of the frameworks per kernel? If so, could you share the results?

Thank you in advance.

ellial avatar Apr 02 '23 07:04 ellial

Hi @ellial,

convolve_s8_kernel3_stride1_pad1 is a deprecated kernel and not actively used in TinyEngine. For 3x3 convolution kernel, we use https://github.com/mit-han-lab/tinyengine/blob/main/TinyEngine/src/kernels/int_forward_op/convolve_u8_kernel3_inputch3_stride2_pad1.c instead. Please also note for mobilenet-like models, most computation goes to pointwise and depthwise convolutions.

meenchen avatar Apr 04 '23 17:04 meenchen