llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Try to use quantized `ggml_mul_mat` in attention layer

Open ggerganov opened this issue 1 year ago • 0 comments

The following 2 matrix multiplication calls sill remain in FP16 precission:

  • https://github.com/ggerganov/llama.cpp/blob/d40fded93e1a533e969768e1e335c15c61c296ce/llama.cpp#L1135-L1137
  • https://github.com/ggerganov/llama.cpp/blob/d40fded93e1a533e969768e1e335c15c61c296ce/llama.cpp#L1158-L1160

Was wondering, if we quantize those on-the-fly would there be any benefit. The quantization can be done with an extra ggml_cpy() call, before the ggml_mul_mat() call.

See if this speeds up the computation and how it affects perplexity

ggerganov avatar Apr 21 '23 07:04 ggerganov