llama.cpp
llama.cpp copied to clipboard
ggml: aarch64: implement SVE kernels for q3_K_q8_K vector dot
This PR introduces support for SVE (Scalable Vector Extensions) kernels for the q3_K_q8_K vector dot on the Arm architecture. A similar proposal for SVE support is made in PR https://github.com/ggerganov/llama.cpp/pull/7433 and https://github.com/ggml-org/llama.cpp/pull/11227.
This PR contains the SVE implementation of the vector dot used to compute the Q3_K quantization. By running a Q3_K quantized model of mistral-7b-v01, on Graviton 3 (Perf 01 XL), Accuracy and Performance are measured.
Performance
The performance enhancement with this PR (SVE) is ~ x1.02 to x1.15 faster than the NEON implementation.
- Decoding Throughput (TPOT)
Threads | NEON (original) | This PR(SVE) | Ratio |
---|---|---|---|
2 | 4.21 | 4.86 | 1.15 |
4 | 8.26 | 9.37 | 1.13 |
8 | 15.90 | 17.49 | 1.10 |
16 | 29.09 | 31.05 | 1.06 |
32 | 42.59 | 43.80 | 1.03 |
48 | 48.36 | 49.41 | 1.02 |
The command used to measure the performance is
./llama-bench -m ${PATH_TO_MODEL} -n 0 -n 16 -p 64 -t 2,4,8,16,32,48
Perplexity
I also verified that perplexity matches between the NEON and SVE Implementation.
NEON (original) | SVE (this PR) |
---|---|
2.9394 +/- 0.35779 | 2.9394 +/- 0.35779 |