llama.cpp ggml: aarch64: implement SVE kernels for q3_K_q8

ggml: aarch64: implement SVE kernels for q3_K_q8_K vector dot

Open Vithulep opened this issue 6 days ago • 1 comments

This PR introduces support for SVE (Scalable Vector Extensions) kernels for the q3_K_q8_K vector dot on the Arm architecture. A similar proposal for SVE support is made in PR https://github.com/ggerganov/llama.cpp/pull/7433 and https://github.com/ggml-org/llama.cpp/pull/11227.

This PR contains the SVE implementation of the vector dot used to compute the Q3_K quantization. By running a Q3_K quantized model of mistral-7b-v01, on Graviton 3 (Perf 01 XL), Accuracy and Performance are measured.

Performance

The performance enhancement with this PR (SVE) is ~ x1.02 to x1.15 faster than the NEON implementation.

Decoding Throughput (TPOT)

Threads	NEON (original)	This PR(SVE)	Ratio
2	4.21	4.86	1.15
4	8.26	9.37	1.13
8	15.90	17.49	1.10
16	29.09	31.05	1.06
32	42.59	43.80	1.03
48	48.36	49.41	1.02

The command used to measure the performance is

./llama-bench  -m ${PATH_TO_MODEL} -n 0 -n 16 -p 64 -t 2,4,8,16,32,48

Perplexity

I also verified that perplexity matches between the NEON and SVE Implementation.

NEON (original)	SVE (this PR)
2.9394 +/- 0.35779	2.9394 +/- 0.35779

Feb 17 '25 03:02 Vithulep

llama.cpp llama.cpp copied to clipboard

ggml: aarch64: implement SVE kernels for q3_K_q8_K vector dot

Performance

Perplexity

llama.cpp
llama.cpp copied to clipboard