ggml-cpu: arm64: q4_K repack gemm and gemv implementations (i8mm)
This PR improves q4_k_q8_k gemm and gemv in arm64 using i8mm and vecdot instructions.
Tested on an Apple M4 Max:
REPACK vs NO REPACK
| model | backend | threads | test | REPACK t/s | NO REPACK t/s | speedup |
|---|---|---|---|---|---|---|
| lfm2 1.2B Q4_K | CPU | 8 | pp256 | 683.08 ± 1.56 | 408.04 ± 0.78 | 1.67 |
| lfm2 1.2B Q4_K | CPU | 8 | tg128 | 235.38 ± 0.81 | 214.06 ± 1.33 | 1.10 |
| lfm2 700M Q4_K | CPU | 8 | pp256 | 1070.35 ± 0.72 | 645.21 ± 15.57 | 1.66 |
| lfm2 700M Q4_K | CPU | 8 | tg128 | 335.03 ± 0.86 | 311.28 ± 3.89 | 1.08 |
| llama 8B Q4_K | CPU | 8 | pp256 | 98.37 ± 1.57 | 60.62 ± 0.20 | 1.62 |
| llama 8B Q4_K | CPU | 8 | tg128 | 42.25 ± 0.52 | 38.33 ± 0.38 | 1.10 |
| qwen3 8B Q4_K | CPU | 8 | pp256 | 92.10 ± 0.35 | 60.11 ± 0.21 | 1.53 |
| qwen3 8B Q4_K | CPU | 8 | tg128 | 40.60 ± 0.35 | 37.22 ± 0.42 | 1.09 |
REPACK: 8a2fd9344 (7070) NO REPACK: 45c6ef730 (7058)
Perplexity
| model | REPACK PPL | NO REPACK PPL |
|---|---|---|
| LFM2 700M Q4_K_M | 20.2207 ± 0.86775 | 20.2207 ± 0.86775 |
| Qwen3 8B 128K Q4_K_M | 10.8862 ± 0.46691 | 10.8862 ± 0.46691 |
| LFM2 1.2B Q4_K_M | 15.5833 ± 0.62558 | 15.5833 ± 0.62558 |
As for test-backend-ops, I've checked the output of the layer tensors manually comparing REPACK vs master, since https://github.com/ggml-org/llama.cpp/pull/16182 is still ongoing.
@ggerganov is there something else needed from my side or are we waiting another review?
There seems to be a bug somewhere. Here is repro on M4 Max:
../scripts/get-wikitext-2.sh
make -j && ./bin/llama-perplexity -hf LiquidAI/LFM2-2.6B-GGUF:Q4_K_M -f ./wikitext-2-raw/wiki.test.raw -dev none
...
# PPL sky-rockets:
0.01.007.961 I perplexity: calculating perplexity over 581 chunks, n_ctx=512, batch_size=2048, n_seq=4
0.05.476.977 I perplexity: 4.47 seconds per pass - ETA 10.82 minutes
[1]6.8941,[2]1485.3563,[3]8468.4132,[4]21269.3291,[5]4800.3655,[6]9365.2385,[7]15453.2190,[8]22744.0153,^C
I was able to replicate the PPL skyrocketing with the generic implementation as well:
# ggml_gemm_q4_K_8x8_q8_K_generic
perplexity: 34.48 seconds per pass - ETA 1.43 minutes
[1]9.6770,[2]1762.7802,[3]9505.4348,[4]22802.6452,[5]5311.2750,[6]10333.9703,[7]16582.8044,[8]23315.3388,[9]11093.7993,[10]14942.7293,
# ggml_gemm_q4_K_8x8_q8_K
perplexity: 2.71 seconds per pass - ETA 0.10 minutes
[1]9.7353,[2]1764.9156,[3]9519.3014,[4]22839.7651,[5]5320.7637,[6]10348.6530,[7]16591.6868,[8]23311.9378
I'll try to figure out what is going on.
Edit:
# Q4_0 Model
perplexity: 1.84 seconds per pass - ETA 0.07 minutes
[1]9.9763,[2]1820.5697,[3]9757.8288,[4]23501.0590,[5]5479.2732,[6]10610.0991,[7]17050.2390,[8]23943.4191,[9]11327.5779,[10]15263.4054,
Also happens with Q4_0 repack. Interesting that it happens from the second chunk onwards. I'll try to run in an AVX machine and see if it's something totally unrelated to the GEMMs themselves
I also compared the tensor outputs of all mul mats for a couple of llama-eval-callback runs and the results were quite identical, except for the 0.0001 deviation here and there.
What I don' t understand is how I was able to run the PPL with LFM correctly, I may have messed up the GGML_CPU_REPACK in the build, sorry about that.
Hm yes - Q4_0 with LFM is indeed also problematic. However Q4_0 with llama 3.1 8B is good. So this means there is a bug that occurs only for certain shapes.
I've opened #17030 for the fix.
Hm yes -
Q4_0with LFM is indeed also problematic. HoweverQ4_0with llama 3.1 8B is good. So this means there is a bug that occurs only for certain shapes.
As you pointed out, LFM2 had some MAT_MUL layers with a (6144, 256, 2, 1) tensor, wher only the first 6144*256 elements were multiplied.
@ggerganov https://github.com/ggml-org/llama.cpp/pull/17241 fixed the perplexity issues. So this PR is again ready for review (It's rebased on top of master).
@ggerganov sorry for pinging again! I don't have merge rights. Could you please?
It's pending review by @slaren
Ah, Sorry for the misunderstanding! I got another merged in with a single review and didn't realize both approvals were needed. Thanks!