ggml icon indicating copy to clipboard operation
ggml copied to clipboard

Depthwise 2D convolution

Open Acly opened this issue 9 months ago • 0 comments

This PR adds kernels for depthwise 2D convolution (CPU only for now).

There is an existing ggml_conv_2d_dw based on im2col + mul_mat, but it has high overhead. The approach makes sense for regular conv2d since it can profit from fast gemm, but depthwise convolution is much simpler and im2col will always slow it down I think.

Timings (W=256, H=256, C=256)

Method Layout Time
ggml_conv_2d_dw WHCN 320 ms ± 25
ggml_depthwise_conv_2d WHCN 25 ms ± 5
ggml_depthwise_conv_2d CWHN 8 ms ± 0.5

Timings (W=1024, H=1024, C=3)

Method Layout Time
ggml_conv_2d_dw WHCN 54.6 ms ± 5
ggml_depthwise_conv_2d WHCN 8.4 ms ± 2
ggml_depthwise_conv_2d CWHN 5.2 ms ± 1

I didn't replace ggml_conv_2d_dw because it supports more backends (and dilation).

Memory layout

Having channels/depth most contiguous in memory allows for better vectorization. It also improves memory access for im2col in regular 2D convolutions, and can avoid many costly ggml_cont(ggml_permute(...)). Since the default for 2D ops on the API seems to be spatial dimension first, this is kept in place, and opportunity to use channels-first kernel is detected from strides. Could also make that more explicit.

Background

I've implemented MobileSAM (fast SAM variant with TinyViT as image encoder) here. Runtime was ~2.1s initially, with depthwise convolution eating a sizeable chunk. After changing memory layout and optimizing conv2d it now runs in 570ms (PyTorch: 608ms, ONNX: 549ms).

Ryzen 5 5600X (6 core, AVX2), windows, OpenBLAS

Acly avatar Mar 20 '25 19:03 Acly