CUDA: add CONV_3D operator support
Summary
Adds CUDA support for GGML_OP_CONV_3D, enabling full 3D convolution on NVIDIA GPUs with correct multi-dimensional indexing.
The implementation matches the CPU semantics exactly, including fused channel dimensions and nb[] byte-stride layout.
Changes
- Added
conv3d.cuandconv3d.cuhwith CUDA kernel and helpers - Added dispatch path in
ggml-cuda.cu - Updated operator registration in
ggml-cuda.cu - Updated
docs/ops.mdanddocs/ops/CUDA.csvto include CONV_3D
Implementation
- One CUDA thread per output element (batch × OC × OD × OH × OW)
- Correct fused-dimension addressing:
- Input:
b * IC + ic - Kernel:
oc * IC + ic - Output:
b * OC + oc
- Input:
- Full nb[] stride-aware indexing matching CPU layout
- Supports F32 input/output and F16/F32 kernel weights
- Fully respects stride, padding, dilation, and 3D spatial dimensions
- Follows existing CUDA backend structure and coding conventions
Testing
- All
CONV_3Dbackend tests pass for CUDA (F32/F16 kernels, all shapes) - Numerical parity with CPU across all tested configurations
- No regressions in CUDA backend test suite
- Full backend test suite passes (no global regressions)
Compatibility
- CUDA backend only
- CPU path unchanged
- No external dependencies added
- Preserves GGML tensor layout conventions
This PR is ready for review. Tagging @CISC and @slaren— your feedback would be greatly appreciated whenever you have the chance. Thanks for your work on maintaining and improving the CUDA backend!
Unfortunately, the reason no backends support CONV_3D is that ggml_conv_3d uses the IM2COL_3D op instead. This is an unused op.
There also exists https://github.com/ggml-org/llama.cpp/pull/16948 . You can use the conv3d test program from that pr to compare the performance.
Thanks for the review and the clarification!
If it makes sense for the project, I can follow up with a small PR that adds optional graph support for GGML_OP_CONV_3D. The idea would be:
• keep the existing IM2COL_3D + MUL_MAT lowering as the default path
• allow backends that explicitly report Conv3D support to receive a native
Conv3D node
• measure the performance impact on CUDA (memory footprint, bandwidth,
end-to-end time)
If the benchmarks show clear benefits for Conv3D on supported backends, then enabling the native path in more scenarios could be considered. Otherwise, the fallback path remains unchanged.
This keeps behavior stable while opening the door for backend-level optimizations, without committing the project to any change in graph lowering.
Let me know if this direction sounds reasonable — happy to iterate.