llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

CUDA & CPU: support F32 kernel type for `CONV_TRANSPOSE_2D`

Open AgainstEntropy opened this issue 1 month ago • 4 comments

also updated test case in test-backend-ops. But since F32 kernel type is not supported on CPU, only GGML_TYPE_F16 is kept and GGML_TYPE_F32 can be uncommented back in the future.

AgainstEntropy avatar Nov 08 '25 02:11 AgainstEntropy

Does this PR make a difference to something? From what I understand, the kernel value is upcast into float before doing any accumulation (and accumulation is anyway in f32). So unless there are kernels around which don't fit into f16 I don't see a benefit to supporting this, especially when we don't support the f16 inputs yet (which incidentally might be more relevant than kernels being f32 as we could potentially do half2 multiplications)

So the motivations of this PR are:

  1. Currently in ggml_backend_cuda_device_supports_op it always returns true for GGML_OP_CONV_TRANSPOSE_2D without checking the kernel type, thus may cause crashes when actually computing. This PR fixs this mismatching behavior.

    https://github.com/ggml-org/llama.cpp/blob/8e878f0cb4c893de23455dd0a6bfbbb21bcaad89/ggml/src/ggml-cuda/ggml-cuda.cu#L4061-L4064

  2. Some recent models are natively BF16, and using F16 kernel can lead to overflows. F32 is safe here and can be readily used for precision verification.

AgainstEntropy avatar Nov 12 '25 18:11 AgainstEntropy

So the motivations of this PR are:

  1. Currently in ggml_backend_cuda_device_supports_op it always returns true for GGML_OP_CONV_TRANSPOSE_2D without checking the kernel type, thus may cause crashes when actually computing. This PR fixs this mismatching behavior.

That's because it matches the CPU capabilities exactly

  1. Some recent models are natively BF16, and using F16 kernel can lead to overflows. F32 is safe here and can be readily used for precision verification.

That would be a problem in a conversion to GGUF, not necessarily a problem to be solved here.

am17an avatar Nov 13 '25 00:11 am17an

You should add the CPU version for the f32 kernel too, that way this PR makes more sense

am17an avatar Nov 13 '25 00:11 am17an

Hi @am17an, thanks for reviewing this PR.

Here’s what has been updated:

  1. Simplified CUDA kernel dispatch logic.
  2. Renamed type_kernel to kernel_type.
  3. Introduced a templated ggml_compute_forward_conv_2d_transpose_impl to reduce duplication.

Please let me know if there’s anything else you’d like changed.

AgainstEntropy avatar Nov 14 '25 21:11 AgainstEntropy