ggml
ggml copied to clipboard
cuda : ggml_mul_mat assert for padded src1
Currently, the padded matrix multiplications in whisper.cpp
are silently failing with CUDA:
https://github.com/ggerganov/ggml/blob/dbd02958fa4f46898f68ca29c27ddcdc58a06f98/examples/whisper/whisper.cpp#L224-L230
The reason is that the to_fp16_cuda
and to_fp32_cuda
calls assume no padding of the data. We can either assert that the data is not padded, or over-allocate a buffer accounting for the padding. The latter produces correct results, but is sub-optimal.
Drafting this PR to brainstorm some potential solutions