Depthwise 2D convolution

Open Acly opened this issue 9 months ago • 0 comments

This PR adds kernels for depthwise 2D convolution (CPU only for now).

There is an existing ggml_conv_2d_dw based on im2col + mul_mat, but it has high overhead. The approach makes sense for regular conv2d since it can profit from fast gemm, but depthwise convolution is much simpler and im2col will always slow it down I think.

Timings (W=256, H=256, C=256)

Method	Layout	Time
`ggml_conv_2d_dw`	WHCN	320 ms ± 25
`ggml_depthwise_conv_2d`	WHCN	25 ms ± 5
`ggml_depthwise_conv_2d`	CWHN	8 ms ± 0.5

Timings (W=1024, H=1024, C=3)

Method	Layout	Time
`ggml_conv_2d_dw`	WHCN	54.6 ms ± 5
`ggml_depthwise_conv_2d`	WHCN	8.4 ms ± 2
`ggml_depthwise_conv_2d`	CWHN	5.2 ms ± 1

I didn't replace ggml_conv_2d_dw because it supports more backends (and dilation).

Memory layout

Having channels/depth most contiguous in memory allows for better vectorization. It also improves memory access for im2col in regular 2D convolutions, and can avoid many costly ggml_cont(ggml_permute(...)). Since the default for 2D ops on the API seems to be spatial dimension first, this is kept in place, and opportunity to use channels-first kernel is detected from strides. Could also make that more explicit.

Background

I've implemented MobileSAM (fast SAM variant with TinyViT as image encoder) here. Runtime was ~2.1s initially, with depthwise convolution eating a sizeable chunk. After changing memory layout and optimizing conv2d it now runs in 570ms (PyTorch: 608ms, ONNX: 549ms).

Ryzen 5 5600X (6 core, AVX2), windows, OpenBLAS

Mar 20 '25 19:03 Acly