ggml
ggml copied to clipboard
ggml : extend ggml_mul_mat to support non-F32 input for parameter `b`
Currently, we always pass b
to ggml_mul_mat
as F32 and internally quantize it depending on the type of a
.
There is no option that allows to pass an already quantized b
.
The primary goal of this task is to add such option. For more info, see: https://github.com/ggerganov/llama.cpp/pull/2615#issuecomment-1680270900
The primary focus will be ggml_mul_mat
, but we can also think about some more general approach for the rest of the operators. For example, ggml_mul
currently also works with just F32 input, which prevents from having 1D F16 norm tensors. This is not a huge drawback since these tensors are usually small, but would be nice to also support F16.
Additionally, we can extend ggml
with parameters that control the implicit quantizations.
I.e. disable / enable / change types, etc. This is secondary objective and not 100% sure how it would work from an API POV