llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

ggml: aarch64: implement SVE kernels for q3_K_q8_K vector dot

Open Vithulep opened this issue 6 days ago • 1 comments

This PR introduces support for SVE (Scalable Vector Extensions) kernels for the q3_K_q8_K vector dot on the Arm architecture. A similar proposal for SVE support is made in PR https://github.com/ggerganov/llama.cpp/pull/7433 and https://github.com/ggml-org/llama.cpp/pull/11227.

This PR contains the SVE implementation of the vector dot used to compute the Q3_K quantization. By running a Q3_K quantized model of mistral-7b-v01, on Graviton 3 (Perf 01 XL), Accuracy and Performance are measured.

Performance

The performance enhancement with this PR (SVE) is ~ x1.02 to x1.15 faster than the NEON implementation.

  • Decoding Throughput (TPOT)
Threads NEON (original) This PR(SVE) Ratio
2 4.21 4.86 1.15
4 8.26 9.37 1.13
8 15.90 17.49 1.10
16 29.09 31.05 1.06
32 42.59 43.80 1.03
48 48.36 49.41 1.02

The command used to measure the performance is

./llama-bench  -m ${PATH_TO_MODEL} -n 0 -n 16 -p 64 -t 2,4,8,16,32,48

Perplexity

I also verified that perplexity matches between the NEON and SVE Implementation.

NEON (original) SVE (this PR)
2.9394 +/- 0.35779 2.9394 +/- 0.35779

Vithulep avatar Feb 17 '25 03:02 Vithulep