llama.cpp [Perf] [CPU] eliminate redundant memory access in group query attention

Modern LLMs (Llama3, qwen 2.5, etc) usually use group query attention, which significantly reduces memory usage caused by KV cache. Group query attention means that query rows of neighbor query heads share kv rows of the same kv head, so we can reorder the loop to:

// python style pseudo code
for group_id in (0,group_num):
    for seq_id in (0, seq_length):
        k = load_k(group_id, seq_id)
        v = load_v(group_id, seq_id)
        for head_id in (group_id * n_gqa, group_id * n_gqa +n_gqa):
              q = load_q(head_id, seq_id)
              compute(q, k, v)

to improve spatial locality of memory access. However the original implemention of cpu flash attention kernel didn't consider that, and this pr improves it.

This is my test command:

./build/bin/llama-cli -t 4 -fa --ctx-size 8192 -m models/Qwen2.5-Coder-7B-Instruct-Q2_K.gguf -f convert_lora_to_gguf.py

The mastrer branch result:

llama_perf_sampler_print:    sampling time =      45.59 ms /  4647 runs   (    0.01 ms per token, 101939.19 tokens per second)
llama_perf_context_print:        load time =     687.54 ms
llama_perf_context_print: prompt eval time =  588053.13 ms /  4412 tokens (  133.28 ms per token,     7.50 tokens per second)
llama_perf_context_print:        eval time =   71929.76 ms /   234 runs   (  307.39 ms per token,     3.25 tokens per second)
llama_perf_context_print:       total time =  660956.03 ms /  4646 tokens
Interrupted by user

With the optimization, the result is:

llama_perf_sampler_print:    sampling time =      56.22 ms /  4717 runs   (    0.01 ms per token, 83901.03 tokens per second)
llama_perf_context_print:        load time =     870.17 ms
llama_perf_context_print: prompt eval time =  574061.97 ms /  4415 tokens (  130.03 ms per token,     7.69 tokens per second)
llama_perf_context_print:        eval time =   71333.37 ms /   301 runs   (  236.99 ms per token,     4.22 tokens per second)
llama_perf_context_print:       total time =  646281.74 ms /  4716 tokens
Interrupted by user

We can see slight speed up in prefill, and 25% speed up in decode!

Further work:

flash decoding: in this pr, when n_kv_head < thread num, and there is only one concurrent request, this cpu kernel cannot use all the threads. we can solve this by using flash decoding.
load balancing between threads: in causual attention, the computation amount is different between rows, but the current implementation dosen't take that into consideration, which slows down the multi-threaded long-context prefill speed.

My test environment:

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
    CPU family:          6
    Model:               142
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            12
    BogoMIPS:            4607.99
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse
                          sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclm
                         ulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor la
                         hf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase bmi1 avx2
                          smep bmi2 erms invpcid rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves md_clear f
                         lush_l1d arch_capabilities
Virtualization features: 
  Hypervisor vendor:     Microsoft
  Virtualization type:   full
Caches (sum of all):     
  L1d:                   128 KiB (4 instances)
  L1i:                   128 KiB (4 instances)
  L2:                    1 MiB (4 instances)
  L3:                    8 MiB (1 instance)
Vulnerabilities:         
  Itlb multihit:         KVM: Mitigation: VMX unsupported
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
  Srbds:                 Mitigation; TSX disabled
  Tsx async abort:       Not affected

May 05 '25 17:05 ZelinMa557

Just tested this out of curiosity: Qwen 3 degrades in quality (ignores /no_think, for example), Mistral Small 2 outputs empty characters. Does it break compatibility with older models? Windows 10, i7-8700, CPU backend only.

May 05 '25 18:05 MaggotHATE

Just tested this out of curiosity: Qwen 3 degrades in quality (ignores /no_think, for example), Mistral Small 2 outputs empty characters. Does it break compatibility with older models? Windows 10, i7-8700, CPU backend only.

Hi, thanks for your reply! It do not break compatibility with older models in theory, but there might be small bugs in my implementation. In my test, it works with Qwen 2.5 7b. Can you tell me the Qwen3 model size you used to test? I will test both qwen3 and mistral to debug.

May 06 '25 03:05 ZelinMa557

Can you tell me the Qwen3 model size you used to test? I will test both qwen3 and mistral to debug.

I've tested both 8b and 4b models in Q6, both worked correctly without this PR. Mistral Small 2 is in Q5_K_L, works correctly on main too.

May 06 '25 04:05 MaggotHATE

I've tested both 8b and 4b models in Q6, both worked correctly without this PR. Mistral Small 2 is in Q5_K_L, works correctly on main too.

Thanks, I have reproduced the same problem. I will try to fix it.

May 06 '25 09:05 ZelinMa557

I have fixed the bug. Are there any scripts to format the code locally? This pr cannot pass the code lint now

May 06 '25 15:05 ZelinMa557

I have fixed the bug. Are there any scripts to format the code locally? This pr cannot pass the code lint now

Thank you! I've already deleted Qwen models, unfortunately, but Mistral Small 2 generates text correctly now. I'll test it a bit more with other models, but so far it seems to be fixed.

On i7 8700 with Mistral Small 3 (the 24b one, q4_k_m) I get 2.08t/s with this PR vs 1.97t/s on current main.

May 07 '25 08:05 MaggotHATE

The CI says that there are trailing whitespaces at line 7045, but I cannot find trailing whitespaces at that line. That is quite strange.

May 08 '25 03:05 ZelinMa557

Hm, I opened your PR in my editor and saw this:

I ran editorconfig-checker, removed the whitespace, and ran it again, and it seems like the error is gone. The line that was flagged was 7058 (running checker locally) instead of 7045 (CI):

Here's a patch to fix the change, if you can't find the line

From 3c7b2ed48acfcb5a9c06846ed0b548b3e48707af Mon Sep 17 00:00:00 2001
From: Excigma <[email protected]>
Date: Mon, 12 May 2025 15:46:03 +1200
Subject: [PATCH] style: remove trailing whitespace

---
 ggml/src/ggml-cpu/ops.cpp | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
index 250b6abc..a1481d9e 100644
--- a/ggml/src/ggml-cpu/ops.cpp
+++ b/ggml/src/ggml-cpu/ops.cpp
@@ -7055,7 +7055,7 @@ static void ggml_compute_forward_flash_attn_ext_f16(
 
             const float * pq = (const float *) ((char *) q->data + (iq1*nbq1 + (iq2 + i_gqa)*nbq2 + iq3*nbq3));
             q_to_vec_dot(pq, Q_q[i_gqa], DK);
-  
+
             const uint32_t h = iq2 + i_gqa;
             slope[i_gqa] = (max_bias > 0.0f) ? h < n_head_log2 ? powf(m0, h + 1) : powf(m1, 2*(h - n_head_log2) + 1) : 1.0f;
         }
-- 
2.49.0

May 12 '25 03:05 Excigma

Hm, I opened your PR in my editor and saw this:

I ran editorconfig-checker, removed the whitespace, and ran it again, and it seems like the error is gone. The line that was flagged was 7058 (running checker locally) instead of 7045 (CI):

Here's a patch to fix the change, if you can't find the line

Thanks for your reply! It's true that line 7045 has some problems, I will fix it.

May 16 '25 06:05 ZelinMa557

@slaren Hi, would you mind reviewing this PR when you have time?

I’ve verified the changes pass all tests and followed the contribution guidelines. Happy to address any feedback!

Thanks for your time! 🙏

Jun 10 '25 12:06 ZelinMa557

I tried this, but the performance I see on my system is not always better, and in some cases it is worse than master. The problem with testing with llama-cli is that it is very susceptible to random variations, and the results are not always repeatable. Instead, you can use the llama-bench and test-backend-ops tools to test the performance. It would be useful if you can provide results with these tools that show a clear improvement on your system.

You can try these commands to get started with these tools:

llama-bench -m <model.gguf> -fa 1 -p 128 -n 32 -d 0,128
test-backend-ops -b CPU -o FLASH_ATTN_EXT perf

Jun 10 '25 20:06 slaren

I tried this, but the performance I see on my system is not always better, and in some cases it is worse than master. The problem with testing with llama-cli is that it is very susceptible to random variations, and the results are not always repeatable. Instead, you can use the llama-bench and test-backend-ops tools to test the performance. It would be useful if you can provide results with these tools that show a clear improvement on your system.

You can try these commands to get started with these tools:
llama-bench -m <model.gguf> -fa 1 -p 128 -n 32 -d 0,128
test-backend-ops -b CPU -o FLASH_ATTN_EXT perf

Thanks for your patient guidance! I will try these commands

Jun 11 '25 01:06 ZelinMa557