ggml : extend ggml_mul_mat to support non-F32 input for parameter `b`
Currently, we always pass b to ggml_mul_mat as F32 and internally quantize it depending on the type of a.
There is no option that allows to pass an already quantized b.
The primary goal of this task is to add such option. For more info, see: https://github.com/ggerganov/llama.cpp/pull/2615#issuecomment-1680270900
The primary focus will be ggml_mul_mat, but we can also think about some more general approach for the rest of the operators. For example, ggml_mul currently also works with just F32 input, which prevents from having 1D F16 norm tensors. This is not a huge drawback since these tensors are usually small, but would be nice to also support F16.
Additionally, we can extend ggml with parameters that control the implicit quantizations.
I.e. disable / enable / change types, etc. This is secondary objective and not 100% sure how it would work from an API POV
I guess this is the reason why 1D tensors needs to be stored as F32 in the model files.
This also seems to be needed for PR https://github.com/ggerganov/llama.cpp/pull/2632
I guess this is the reason why 1D tensors needs to be stored as F32 in the model files.
Not exactly - it is a similar problem, but it is related to ggml_mul, not ggml_mul_mat.
I've expanded the issue description to mention this and we can try to fix this within the scope of this task.
Additionally, we can extend ggml with parameters that control the implicit quantizations. I.e. disable / enable / change types, etc. This is secondary objective and not 100% sure how it would work from an API POV
We could add a flag to ggml_context to control whether to automatically quantize b in ggml_mul_mat. Essentially, we would add this to ggml_mul_mat:
if (ctx->quantize_matmul && b->type != type_traits[a->type].vec_dot_type) {
struct ggml_tensor * tmp_b = ggml_new_tensor(ctx, type_traits[a->type].vec_dot_type, b->n_dims, b->ne);
b = ggml_cpy(ctx, b, tmp_b);
}
The disadvantage is that it may add a bit of overhead by increasing the size of the graphs, but it shouldn't be too bad. To avoid this, we could set a flag in op_params instead, and let each backend handle the conversion, but that doesn't solve the memory management issue.
Somewhat related to this, it would be good to have a ggml_convert(src, new_type) or similar that is equivalent to ggml_cpy(src, ggml_new_tensor(new_type, 4, src->ne)), but only creates one tensor. Currently, converting a tensor to a different type requires adding two tensors to the context, and this may become a not-completely-insignificant overhead if most matrix multiplications require one or two conversions. It would also result in a slightly more intuitive API, since currently it is not too obvious that ggml_cpy can be used to convert a tensor to a different type.