llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

ggml-cpu: arm64: q4_K repack gemm and gemv implementations (i8mm)

Open Alcpz opened this issue 2 months ago • 6 comments

This PR improves q4_k_q8_k gemm and gemv in arm64 using i8mm and vecdot instructions.

Tested on an Apple M4 Max:

REPACK vs NO REPACK

model backend threads test REPACK t/s NO REPACK t/s speedup
lfm2 1.2B Q4_K CPU 8 pp256 683.08 ± 1.56 408.04 ± 0.78 1.67
lfm2 1.2B Q4_K CPU 8 tg128 235.38 ± 0.81 214.06 ± 1.33 1.10
lfm2 700M Q4_K CPU 8 pp256 1070.35 ± 0.72 645.21 ± 15.57 1.66
lfm2 700M Q4_K CPU 8 tg128 335.03 ± 0.86 311.28 ± 3.89 1.08
llama 8B Q4_K CPU 8 pp256 98.37 ± 1.57 60.62 ± 0.20 1.62
llama 8B Q4_K CPU 8 tg128 42.25 ± 0.52 38.33 ± 0.38 1.10
qwen3 8B Q4_K CPU 8 pp256 92.10 ± 0.35 60.11 ± 0.21 1.53
qwen3 8B Q4_K CPU 8 tg128 40.60 ± 0.35 37.22 ± 0.42 1.09

REPACK: 8a2fd9344 (7070) NO REPACK: 45c6ef730 (7058)

Perplexity

model REPACK PPL NO REPACK PPL
LFM2 700M Q4_K_M 20.2207 ± 0.86775 20.2207 ± 0.86775
Qwen3 8B 128K Q4_K_M 10.8862 ± 0.46691 10.8862 ± 0.46691
LFM2 1.2B Q4_K_M 15.5833 ± 0.62558 15.5833 ± 0.62558

As for test-backend-ops, I've checked the output of the layer tensors manually comparing REPACK vs master, since https://github.com/ggml-org/llama.cpp/pull/16182 is still ongoing.

Alcpz avatar Oct 23 '25 12:10 Alcpz

@ggerganov is there something else needed from my side or are we waiting another review?

Alcpz avatar Oct 31 '25 11:10 Alcpz

There seems to be a bug somewhere. Here is repro on M4 Max:

../scripts/get-wikitext-2.sh
make -j && ./bin/llama-perplexity -hf LiquidAI/LFM2-2.6B-GGUF:Q4_K_M -f ./wikitext-2-raw/wiki.test.raw -dev none

...

# PPL sky-rockets:
0.01.007.961 I perplexity: calculating perplexity over 581 chunks, n_ctx=512, batch_size=2048, n_seq=4
0.05.476.977 I perplexity: 4.47 seconds per pass - ETA 10.82 minutes
[1]6.8941,[2]1485.3563,[3]8468.4132,[4]21269.3291,[5]4800.3655,[6]9365.2385,[7]15453.2190,[8]22744.0153,^C

ggerganov avatar Oct 31 '25 12:10 ggerganov

I was able to replicate the PPL skyrocketing with the generic implementation as well:

# ggml_gemm_q4_K_8x8_q8_K_generic
perplexity: 34.48 seconds per pass - ETA 1.43 minutes
[1]9.6770,[2]1762.7802,[3]9505.4348,[4]22802.6452,[5]5311.2750,[6]10333.9703,[7]16582.8044,[8]23315.3388,[9]11093.7993,[10]14942.7293,

# ggml_gemm_q4_K_8x8_q8_K
perplexity: 2.71 seconds per pass - ETA 0.10 minutes
[1]9.7353,[2]1764.9156,[3]9519.3014,[4]22839.7651,[5]5320.7637,[6]10348.6530,[7]16591.6868,[8]23311.9378

I'll try to figure out what is going on.

Edit:

# Q4_0 Model
perplexity: 1.84 seconds per pass - ETA 0.07 minutes
[1]9.9763,[2]1820.5697,[3]9757.8288,[4]23501.0590,[5]5479.2732,[6]10610.0991,[7]17050.2390,[8]23943.4191,[9]11327.5779,[10]15263.4054,

Also happens with Q4_0 repack. Interesting that it happens from the second chunk onwards. I'll try to run in an AVX machine and see if it's something totally unrelated to the GEMMs themselves

I also compared the tensor outputs of all mul mats for a couple of llama-eval-callback runs and the results were quite identical, except for the 0.0001 deviation here and there.

What I don' t understand is how I was able to run the PPL with LFM correctly, I may have messed up the GGML_CPU_REPACK in the build, sorry about that.

Alcpz avatar Oct 31 '25 13:10 Alcpz

Hm yes - Q4_0 with LFM is indeed also problematic. However Q4_0 with llama 3.1 8B is good. So this means there is a bug that occurs only for certain shapes.

ggerganov avatar Oct 31 '25 14:10 ggerganov

I've opened #17030 for the fix.

Hm yes - Q4_0 with LFM is indeed also problematic. However Q4_0 with llama 3.1 8B is good. So this means there is a bug that occurs only for certain shapes.

As you pointed out, LFM2 had some MAT_MUL layers with a (6144, 256, 2, 1) tensor, wher only the first 6144*256 elements were multiplied.

Alcpz avatar Nov 05 '25 17:11 Alcpz

@ggerganov https://github.com/ggml-org/llama.cpp/pull/17241 fixed the perplexity issues. So this PR is again ready for review (It's rebased on top of master).

Alcpz avatar Nov 18 '25 11:11 Alcpz

@ggerganov sorry for pinging again! I don't have merge rights. Could you please?

Alcpz avatar Nov 20 '25 12:11 Alcpz

It's pending review by @slaren

ggerganov avatar Nov 20 '25 12:11 ggerganov

Ah, Sorry for the misunderstanding! I got another merged in with a single review and didn't realize both approvals were needed. Thanks!

Alcpz avatar Nov 20 '25 16:11 Alcpz