[Perf] [CPU] eliminate redundant memory access in group query attention
Modern LLMs (Llama3, qwen 2.5, etc) usually use group query attention, which significantly reduces memory usage caused by KV cache. Group query attention means that query rows of neighbor query heads share kv rows of the same kv head, so we can reorder the loop to:
// python style pseudo code
for group_id in (0,group_num):
for seq_id in (0, seq_length):
k = load_k(group_id, seq_id)
v = load_v(group_id, seq_id)
for head_id in (group_id * n_gqa, group_id * n_gqa +n_gqa):
q = load_q(head_id, seq_id)
compute(q, k, v)
to improve spatial locality of memory access. However the original implemention of cpu flash attention kernel didn't consider that, and this pr improves it.
This is my test command:
./build/bin/llama-cli -t 4 -fa --ctx-size 8192 -m models/Qwen2.5-Coder-7B-Instruct-Q2_K.gguf -f convert_lora_to_gguf.py
The mastrer branch result:
llama_perf_sampler_print: sampling time = 45.59 ms / 4647 runs ( 0.01 ms per token, 101939.19 tokens per second)
llama_perf_context_print: load time = 687.54 ms
llama_perf_context_print: prompt eval time = 588053.13 ms / 4412 tokens ( 133.28 ms per token, 7.50 tokens per second)
llama_perf_context_print: eval time = 71929.76 ms / 234 runs ( 307.39 ms per token, 3.25 tokens per second)
llama_perf_context_print: total time = 660956.03 ms / 4646 tokens
Interrupted by user
With the optimization, the result is:
llama_perf_sampler_print: sampling time = 56.22 ms / 4717 runs ( 0.01 ms per token, 83901.03 tokens per second)
llama_perf_context_print: load time = 870.17 ms
llama_perf_context_print: prompt eval time = 574061.97 ms / 4415 tokens ( 130.03 ms per token, 7.69 tokens per second)
llama_perf_context_print: eval time = 71333.37 ms / 301 runs ( 236.99 ms per token, 4.22 tokens per second)
llama_perf_context_print: total time = 646281.74 ms / 4716 tokens
Interrupted by user
We can see slight speed up in prefill, and 25% speed up in decode!
Further work:
- flash decoding: in this pr, when n_kv_head < thread num, and there is only one concurrent request, this cpu kernel cannot use all the threads. we can solve this by using flash decoding.
- load balancing between threads: in causual attention, the computation amount is different between rows, but the current implementation dosen't take that into consideration, which slows down the multi-threaded long-context prefill speed.
My test environment:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
CPU family: 6
Model: 142
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
Stepping: 12
BogoMIPS: 4607.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse
sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclm
ulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor la
hf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase bmi1 avx2
smep bmi2 erms invpcid rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves md_clear f
lush_l1d arch_capabilities
Virtualization features:
Hypervisor vendor: Microsoft
Virtualization type: full
Caches (sum of all):
L1d: 128 KiB (4 instances)
L1i: 128 KiB (4 instances)
L2: 1 MiB (4 instances)
L3: 8 MiB (1 instance)
Vulnerabilities:
Itlb multihit: KVM: Mitigation: VMX unsupported
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Srbds: Mitigation; TSX disabled
Tsx async abort: Not affected
Just tested this out of curiosity: Qwen 3 degrades in quality (ignores /no_think, for example), Mistral Small 2 outputs empty characters. Does it break compatibility with older models? Windows 10, i7-8700, CPU backend only.
Just tested this out of curiosity: Qwen 3 degrades in quality (ignores
/no_think, for example), Mistral Small 2 outputs empty characters. Does it break compatibility with older models? Windows 10, i7-8700, CPU backend only.
Hi, thanks for your reply! It do not break compatibility with older models in theory, but there might be small bugs in my implementation. In my test, it works with Qwen 2.5 7b. Can you tell me the Qwen3 model size you used to test? I will test both qwen3 and mistral to debug.
Can you tell me the Qwen3 model size you used to test? I will test both qwen3 and mistral to debug.
I've tested both 8b and 4b models in Q6, both worked correctly without this PR. Mistral Small 2 is in Q5_K_L, works correctly on main too.
I've tested both 8b and 4b models in Q6, both worked correctly without this PR. Mistral Small 2 is in Q5_K_L, works correctly on main too.
Thanks, I have reproduced the same problem. I will try to fix it.
I have fixed the bug. Are there any scripts to format the code locally? This pr cannot pass the code lint now
I have fixed the bug. Are there any scripts to format the code locally? This pr cannot pass the code lint now
Thank you! I've already deleted Qwen models, unfortunately, but Mistral Small 2 generates text correctly now. I'll test it a bit more with other models, but so far it seems to be fixed.
On i7 8700 with Mistral Small 3 (the 24b one, q4_k_m) I get 2.08t/s with this PR vs 1.97t/s on current main.
Hm, I opened your PR in my editor and saw this:
I ran editorconfig-checker, removed the whitespace, and ran it again, and it seems like the error is gone. The line that was flagged was 7058 (running checker locally) instead of 7045 (CI):
Here's a patch to fix the change, if you can't find the line
From 3c7b2ed48acfcb5a9c06846ed0b548b3e48707af Mon Sep 17 00:00:00 2001
From: Excigma <[email protected]>
Date: Mon, 12 May 2025 15:46:03 +1200
Subject: [PATCH] style: remove trailing whitespace
---
ggml/src/ggml-cpu/ops.cpp | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
index 250b6abc..a1481d9e 100644
--- a/ggml/src/ggml-cpu/ops.cpp
+++ b/ggml/src/ggml-cpu/ops.cpp
@@ -7055,7 +7055,7 @@ static void ggml_compute_forward_flash_attn_ext_f16(
const float * pq = (const float *) ((char *) q->data + (iq1*nbq1 + (iq2 + i_gqa)*nbq2 + iq3*nbq3));
q_to_vec_dot(pq, Q_q[i_gqa], DK);
-
+
const uint32_t h = iq2 + i_gqa;
slope[i_gqa] = (max_bias > 0.0f) ? h < n_head_log2 ? powf(m0, h + 1) : powf(m1, 2*(h - n_head_log2) + 1) : 1.0f;
}
--
2.49.0
Hm, I opened your PR in my editor and saw this:
I ran
editorconfig-checker, removed the whitespace, and ran it again, and it seems like the error is gone. The line that was flagged was7058(running checker locally) instead of7045(CI):Here's a patch to fix the change, if you can't find the line
Thanks for your reply! It's true that line 7045 has some problems, I will fix it.
@slaren Hi, would you mind reviewing this PR when you have time?
I’ve verified the changes pass all tests and followed the contribution guidelines. Happy to address any feedback!
Thanks for your time! 🙏
I tried this, but the performance I see on my system is not always better, and in some cases it is worse than master. The problem with testing with llama-cli is that it is very susceptible to random variations, and the results are not always repeatable. Instead, you can use the llama-bench and test-backend-ops tools to test the performance. It would be useful if you can provide results with these tools that show a clear improvement on your system.
You can try these commands to get started with these tools:
llama-bench -m <model.gguf> -fa 1 -p 128 -n 32 -d 0,128
test-backend-ops -b CPU -o FLASH_ATTN_EXT perf
I tried this, but the performance I see on my system is not always better, and in some cases it is worse than master. The problem with testing with
llama-cliis that it is very susceptible to random variations, and the results are not always repeatable. Instead, you can use thellama-benchandtest-backend-opstools to test the performance. It would be useful if you can provide results with these tools that show a clear improvement on your system.You can try these commands to get started with these tools:
llama-bench -m <model.gguf> -fa 1 -p 128 -n 32 -d 0,128 test-backend-ops -b CPU -o FLASH_ATTN_EXT perf
Thanks for your patient guidance! I will try these commands

