llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

llama : quantize attention results

Open ggerganov opened this issue 1 year ago • 0 comments

ref #1098

Here we re-quantize the F16 intermediate results in the attention layer. This way, all matrix multiplication in the transformer become quantized.

Putting this here just for reference. Haven't played with it to see if the speed actually improves and how much the results are affected. Probably not worth it.

Also, we can quantize V only when (n_past + N) % QK == 0), so most of the iterations will remain using F16 precision

ggerganov avatar Apr 21 '23 14:04 ggerganov