llama.cpp
llama.cpp copied to clipboard
Try to use quantized `ggml_mul_mat` in attention layer
The following 2 matrix multiplication calls sill remain in FP16 precission:
- https://github.com/ggerganov/llama.cpp/blob/d40fded93e1a533e969768e1e335c15c61c296ce/llama.cpp#L1135-L1137
- https://github.com/ggerganov/llama.cpp/blob/d40fded93e1a533e969768e1e335c15c61c296ce/llama.cpp#L1158-L1160
Was wondering, if we quantize those on-the-fly would there be any benefit.
The quantization can be done with an extra ggml_cpy()
call, before the ggml_mul_mat()
call.
See if this speeds up the computation and how it affects perplexity