llama.cpp
llama.cpp copied to clipboard
llama : quantize attention results
ref #1098
Here we re-quantize the F16 intermediate results in the attention layer. This way, all matrix multiplication in the transformer become quantized.
Putting this here just for reference. Haven't played with it to see if the speed actually improves and how much the results are affected. Probably not worth it.
Also, we can quantize V
only when (n_past + N) % QK == 0
), so most of the iterations will remain using F16 precision