llama.cpp CUDA: add CONV_3D operator support

Summary

Adds CUDA support for GGML_OP_CONV_3D, enabling full 3D convolution on NVIDIA GPUs with correct multi-dimensional indexing.
The implementation matches the CPU semantics exactly, including fused channel dimensions and nb[] byte-stride layout.

Changes

Added conv3d.cu and conv3d.cuh with CUDA kernel and helpers
Added dispatch path in ggml-cuda.cu
Updated operator registration in ggml-cuda.cu
Updated docs/ops.md and docs/ops/CUDA.csv to include CONV_3D

Implementation

One CUDA thread per output element (batch × OC × OD × OH × OW)
Correct fused-dimension addressing:
- Input: b * IC + ic
- Kernel: oc * IC + ic
- Output: b * OC + oc
Full nb[] stride-aware indexing matching CPU layout
Supports F32 input/output and F16/F32 kernel weights
Fully respects stride, padding, dilation, and 3D spatial dimensions
Follows existing CUDA backend structure and coding conventions

Testing

All CONV_3D backend tests pass for CUDA (F32/F16 kernels, all shapes)
Numerical parity with CPU across all tested configurations
No regressions in CUDA backend test suite
Full backend test suite passes (no global regressions)

Compatibility

CUDA backend only
CPU path unchanged
No external dependencies added
Preserves GGML tensor layout conventions

Nov 14 '25 01:11 YaelGitAccount

This PR is ready for review. Tagging @CISC and @slaren— your feedback would be greatly appreciated whenever you have the chance. Thanks for your work on maintaining and improving the CUDA backend!

Nov 14 '25 08:11 YaelGitAccount

Unfortunately, the reason no backends support CONV_3D is that ggml_conv_3d uses the IM2COL_3D op instead. This is an unused op.

Nov 14 '25 09:11 CISC

There also exists https://github.com/ggml-org/llama.cpp/pull/16948 . You can use the conv3d test program from that pr to compare the performance.

Nov 14 '25 09:11 Green-Sky

Thanks for the review and the clarification!

If it makes sense for the project, I can follow up with a small PR that adds optional graph support for GGML_OP_CONV_3D. The idea would be:

• keep the existing IM2COL_3D + MUL_MAT lowering as the default path
• allow backends that explicitly report Conv3D support to receive a native Conv3D node
• measure the performance impact on CUDA (memory footprint, bandwidth, end-to-end time)

If the benchmarks show clear benefits for Conv3D on supported backends, then enabling the native path in more scenarios could be considered. Otherwise, the fallback path remains unchanged.

This keeps behavior stable while opening the door for backend-level optimizations, without committing the project to any change in graph lowering.

Let me know if this direction sounds reasonable — happy to iterate.

Nov 16 '25 12:11 YaelGitAccount