ggml icon indicating copy to clipboard operation
ggml copied to clipboard

cuda : ggml_mul_mat assert for padded src1

Open ggerganov opened this issue 1 year ago • 6 comments

Currently, the padded matrix multiplications in whisper.cpp are silently failing with CUDA:

https://github.com/ggerganov/ggml/blob/dbd02958fa4f46898f68ca29c27ddcdc58a06f98/examples/whisper/whisper.cpp#L224-L230

The reason is that the to_fp16_cuda and to_fp32_cuda calls assume no padding of the data. We can either assert that the data is not padded, or over-allocate a buffer accounting for the padding. The latter produces correct results, but is sub-optimal.

Drafting this PR to brainstorm some potential solutions

ggerganov avatar Dec 29 '23 08:12 ggerganov

Is there any benefit to using ggml_mul_mat_pad with CUDA (and specially with cuBLAS)? This seems to me like it should be a implementation detail of the matrix multiplication algorithm.

slaren avatar Dec 29 '23 10:12 slaren

Can't test the ggml_mul_mat_pad with CUDA because it would need some extra changes to to_fp16_cuda. Likely there will be no benefit in this case because cuBLAS will probably make the necessary optimizations internally. For Metal it helps because that mat-mat kernel requires dim 0 to be multiple of 32

But regardless, the issue remains if users try to perform matrix multiplications with padded views

ggerganov avatar Dec 29 '23 11:12 ggerganov

I think that in general there is very poor support for any kind of non-contiguous tensors in ggml-cuda. We should definitely add more checks and improve support, but I am afraid that this is just the tip of the iceberg. We could extend test-backend-ops to automatically run tests with non-contiguous tensors for all the operations, so that we could have a better overall picture of what needs to be fixed.

slaren avatar Dec 29 '23 11:12 slaren

At the moment I am working on moving llama.cpp fully to ggml-backend with ggml_backend_sched for partial offloading, and bringing everything up to feature parity with the old CUDA API. After that, we can remove a lot of code, globals, and generally do a lot of refactoring of the CUDA backend.

slaren avatar Dec 29 '23 11:12 slaren

Sounds great - no rush on this padding problem

ggerganov avatar Dec 29 '23 12:12 ggerganov

I think the metal backend could benefit from expanding matrix multiplication support and significantly improving performance in some cases such as comment.

FSSRepo avatar Dec 29 '23 16:12 FSSRepo