llama.cpp ggml-cpu: arm64: q4_K repack gemm and gemv implementations (i8mm)

This PR improves q4_k_q8_k gemm and gemv in arm64 using i8mm and vecdot instructions.

Tested on an Apple M4 Max:

REPACK vs NO REPACK

model	backend	threads	test	REPACK t/s	NO REPACK t/s	speedup
lfm2 1.2B Q4_K	CPU	8	pp256	683.08 ± 1.56	408.04 ± 0.78	1.67
lfm2 1.2B Q4_K	CPU	8	tg128	235.38 ± 0.81	214.06 ± 1.33	1.10
lfm2 700M Q4_K	CPU	8	pp256	1070.35 ± 0.72	645.21 ± 15.57	1.66
lfm2 700M Q4_K	CPU	8	tg128	335.03 ± 0.86	311.28 ± 3.89	1.08
llama 8B Q4_K	CPU	8	pp256	98.37 ± 1.57	60.62 ± 0.20	1.62
llama 8B Q4_K	CPU	8	tg128	42.25 ± 0.52	38.33 ± 0.38	1.10
qwen3 8B Q4_K	CPU	8	pp256	92.10 ± 0.35	60.11 ± 0.21	1.53
qwen3 8B Q4_K	CPU	8	tg128	40.60 ± 0.35	37.22 ± 0.42	1.09

REPACK: 8a2fd9344 (7070) NO REPACK: 45c6ef730 (7058)

Perplexity

model	REPACK PPL	NO REPACK PPL
LFM2 700M Q4_K_M	20.2207 ± 0.86775	20.2207 ± 0.86775
Qwen3 8B 128K Q4_K_M	10.8862 ± 0.46691	10.8862 ± 0.46691
LFM2 1.2B Q4_K_M	15.5833 ± 0.62558	15.5833 ± 0.62558

As for test-backend-ops, I've checked the output of the layer tensors manually comparing REPACK vs master, since https://github.com/ggml-org/llama.cpp/pull/16182 is still ongoing.

Oct 23 '25 12:10 Alcpz

@ggerganov is there something else needed from my side or are we waiting another review?

Oct 31 '25 11:10 Alcpz

There seems to be a bug somewhere. Here is repro on M4 Max:

../scripts/get-wikitext-2.sh
make -j && ./bin/llama-perplexity -hf LiquidAI/LFM2-2.6B-GGUF:Q4_K_M -f ./wikitext-2-raw/wiki.test.raw -dev none

...

# PPL sky-rockets:
0.01.007.961 I perplexity: calculating perplexity over 581 chunks, n_ctx=512, batch_size=2048, n_seq=4
0.05.476.977 I perplexity: 4.47 seconds per pass - ETA 10.82 minutes
[1]6.8941,[2]1485.3563,[3]8468.4132,[4]21269.3291,[5]4800.3655,[6]9365.2385,[7]15453.2190,[8]22744.0153,^C

Oct 31 '25 12:10 ggerganov

I was able to replicate the PPL skyrocketing with the generic implementation as well:

# ggml_gemm_q4_K_8x8_q8_K_generic
perplexity: 34.48 seconds per pass - ETA 1.43 minutes
[1]9.6770,[2]1762.7802,[3]9505.4348,[4]22802.6452,[5]5311.2750,[6]10333.9703,[7]16582.8044,[8]23315.3388,[9]11093.7993,[10]14942.7293,

# ggml_gemm_q4_K_8x8_q8_K
perplexity: 2.71 seconds per pass - ETA 0.10 minutes
[1]9.7353,[2]1764.9156,[3]9519.3014,[4]22839.7651,[5]5320.7637,[6]10348.6530,[7]16591.6868,[8]23311.9378

I'll try to figure out what is going on.

Edit:

# Q4_0 Model
perplexity: 1.84 seconds per pass - ETA 0.07 minutes
[1]9.9763,[2]1820.5697,[3]9757.8288,[4]23501.0590,[5]5479.2732,[6]10610.0991,[7]17050.2390,[8]23943.4191,[9]11327.5779,[10]15263.4054,

Also happens with Q4_0 repack. Interesting that it happens from the second chunk onwards. I'll try to run in an AVX machine and see if it's something totally unrelated to the GEMMs themselves

I also compared the tensor outputs of all mul mats for a couple of llama-eval-callback runs and the results were quite identical, except for the 0.0001 deviation here and there.

What I don' t understand is how I was able to run the PPL with LFM correctly, I may have messed up the GGML_CPU_REPACK in the build, sorry about that.

Oct 31 '25 13:10 Alcpz

Hm yes - Q4_0 with LFM is indeed also problematic. However Q4_0 with llama 3.1 8B is good. So this means there is a bug that occurs only for certain shapes.

Oct 31 '25 14:10 ggerganov

I've opened #17030 for the fix.

Hm yes - Q4_0 with LFM is indeed also problematic. However Q4_0 with llama 3.1 8B is good. So this means there is a bug that occurs only for certain shapes.

As you pointed out, LFM2 had some MAT_MUL layers with a (6144, 256, 2, 1) tensor, wher only the first 6144*256 elements were multiplied.

Nov 05 '25 17:11 Alcpz

@ggerganov https://github.com/ggml-org/llama.cpp/pull/17241 fixed the perplexity issues. So this PR is again ready for review (It's rebased on top of master).

Nov 18 '25 11:11 Alcpz

@ggerganov sorry for pinging again! I don't have merge rights. Could you please?

Nov 20 '25 12:11 Alcpz

It's pending review by @slaren

Nov 20 '25 12:11 ggerganov

Ah, Sorry for the misunderstanding! I got another merged in with a single review and didn't realize both approvals were needed. Thanks!

Nov 20 '25 16:11 Alcpz