tvm [microTVM] Replace fancy depthwise_conv2d kernel packing scheme

trafficstars

For a while, I've intended to fix my depthwise_conv2d schedule so that its unique weight repacking scheme happens at compile time instead of during inference. While working on this though, I discovered that the SMLAD instruction we use to compute multiplications in parallel does not actually save time.

Recall that the SMLAD instruction takes two int16*2 values x1::x2 and y1::y2 and an accumulator z, and computes z += x1 * y1 + x2 * y2. For NHWC layouts however, the relevant x1::x2 values in the input tensor are not next to each other, though. Previously, we used a DSP-specific halfword packing instruction __PKHBT to fix this, and then called __SMLAD after - two instructions for two multiplies. This is also what CMSIS-NN's most optimized depthwise convolution code does.

However, there is a lesser-known non-DSP instruction SMLAxy that is present on all Cortex-M cores (see docs). This instruction allows us to only read one 16-bit half of an int32 register while performing multiply accumulates, allowing us to skip the PKHBT instruction. Doing the multiplies this way is just as fast, while being way more versatile and simpler.

This PR changes the Cortex-M depthwise convolution setup to use SMLAxy instead of SMLAD. It also removes the 3x3 kernel restriction, and the complicated kernel packing mechanism. The net effect is that this schedule has gotten slightly faster (about 10% for 3x3 kernels) for kernels with an odd number of entries.

This is still a draft PR, as I need to remove an issue where topi.reshape introduces redundant instructions.

Sep 21 '22 10:09 guberti

cc @Mousius @ekalda @leandron

Sep 21 '22 15:09 areusch

Why doesn't `SMLAD` improve performance?

Recall that the SMLAD instruction takes two int16*2 values x1::x2 and y1::y2 and an accumulator z, and computes z += x1 * y1 + x2 * y2. For NHWC layouts however, the relevant x1::x2 values in the input tensor are not next to each other, though. Previously, we used a DSP-specific halfword packing instruction PKHBT to fix this, and then called SMLAD after - two instructions for two multiplies.

However, there is a non-DSP instruction SMLAxy that is present on all Cortex-M cores (see docs). This instruction allows us to only read one 16-bit half of an int32 register while performing multiply accumulates, allowing us to skip the PKHBT instruction. Doing the multiplies this way is just as fast, while being way more versatile and simpler.

This means it is impossible to use SMLAD to speed up our performance for input tensors in NHWC format - nowhere in the input tensor is the relevant data in the correct format, which necessitates the use of at least one instruction to fix this. However, this extra instruction already removes all benefit of SMLAD compared to the non-DSP instruction SMLAxy.

Note that for the NCHW format, SMLAD would be very helpful. We should look into changing the format in the Relay graph, as this would yield a major performance improvement.

Sep 27 '22 13:09 guberti

Thanks for the detailed comments @areusch @tkonolige! I've addressed your comments with dee04b10 - please take another look.

Sep 27 '22 21:09 guberti

tvm tvm copied to clipboard

[microTVM] Replace fancy depthwise_conv2d kernel packing scheme

Why doesn't SMLAD improve performance?

tvm
tvm copied to clipboard

Why doesn't `SMLAD` improve performance?