llama.cpp Misc. bug: Flash attention on Vulkan

Name and Version

$ ./build/bin/llama-cli --version version: 4941 (ba932dfb) built with cc (GCC) 14.2.1 20250110 (Red Hat 14.2.1-7) for x86_64-redhat-linux

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

Other (Please specify in the next section)

Command line

./build/bin/llama-bench -ngl 99 -m models/Qwen2.5-Coder-14B-Instruct-Q4_K_L.gguf -fa 0,1
./build/bin/llama-bench -ngl 99 -m models/Qwen2.5-Coder-1.5B-Instruct-Q8_0.gguf -fa 0,1

Problem description & steps to reproduce

It seems that some FA operations are not yet handled by the Vulkan backend and fall back to CPU. But I can't find any open issue on this, only several closed ones, so maybe it's just some model archs or just my GPU that can't do it yet?

Using Mesa RADV 25.0.1 on Linux Fedora 41. AMD Radeon RX 6700 XT, gfx1030 (0x1030), VMM: no, Wave Size: 32

model	size	params	backend	ngl	fa	test	t/s
qwen2 14B Q4_K - Medium	8.90 GiB	14.77 B	ROCm	99	0	pp512	408.32 ± 0.13
qwen2 14B Q4_K - Medium	8.90 GiB	14.77 B	ROCm	99	0	tg128	27.97 ± 0.01
qwen2 14B Q4_K - Medium	8.90 GiB	14.77 B	ROCm	99	1	pp512	352.63 ± 3.31
qwen2 14B Q4_K - Medium	8.90 GiB	14.77 B	ROCm	99	1	tg128	26.74 ± 0.01
qwen2 14B Q4_K - Medium	8.90 GiB	14.77 B	Vulkan	99	0	pp512	256.50 ± 0.14
qwen2 14B Q4_K - Medium	8.90 GiB	14.77 B	Vulkan	99	0	tg128	34.37 ± 0.09
qwen2 14B Q4_K - Medium	8.90 GiB	14.77 B	Vulkan	99	1	pp512	100.24 ± 1.31
qwen2 14B Q4_K - Medium	8.90 GiB	14.77 B	Vulkan	99	1	tg128	22.07 ± 0.03

build: ba932dfb (4941)

Seems that perf loss is more pronounced with smaller models:

model	size	params	backend	ngl	fa	test	t/s
qwen2 1.5B Q8_0	1.53 GiB	1.54 B	ROCm	99	0	pp512	4682.96 ± 2.85
qwen2 1.5B Q8_0	1.53 GiB	1.54 B	ROCm	99	0	tg128	113.17 ± 0.03
qwen2 1.5B Q8_0	1.53 GiB	1.54 B	ROCm	99	1	pp512	3458.94 ± 3.00
qwen2 1.5B Q8_0	1.53 GiB	1.54 B	ROCm	99	1	tg128	100.59 ± 0.22
qwen2 1.5B Q8_0	1.53 GiB	1.54 B	Vulkan	99	0	pp512	2778.95 ± 4.85
qwen2 1.5B Q8_0	1.53 GiB	1.54 B	Vulkan	99	0	tg128	140.45 ± 0.39
qwen2 1.5B Q8_0	1.53 GiB	1.54 B	Vulkan	99	1	pp512	685.72 ± 5.62
qwen2 1.5B Q8_0	1.53 GiB	1.54 B	Vulkan	99	1	tg128	42.96 ± 0.83

build: ba932dfb (4941)

First Bad Commit

No response

Relevant log output

Mar 23 '25 12:03 Nindaleth

Flash attention in Vulkan is currently only implemented for Nvidia GPUs for legacy quants on a Vulkan Beta driver. Everywhere else it falls back to CPU.

Mar 24 '25 10:03 0cc4m

OK, that's both bad and good to hear.

Is some work on this ongoing in private repos? I'm neither pushing anyone nor in hurry; I'm good with "don't expect it this year" or any other realistic answer.

Mar 24 '25 11:03 Nindaleth

Handyman Home Remodeling working 7047636355

El lun, 24 de mar de 2025, 7:19 a. m., Nindaleth @.***> escribió:

OK, that's both bad and good to hear.

Is some work on this ongoing in private repos? I'm neither pushing anyone nor in hurry; I'm good with "don't expect it this year" or any other realistic answer.

— Reply to this email directly, view it on GitHub https://github.com/ggml-org/llama.cpp/issues/12526#issuecomment-2747777636, or unsubscribe https://github.com/notifications/unsubscribe-auth/BQFRXQY5GXBZWJUEZQK3NUD2V7S55AVCNFSM6AAAAABZTABXLSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONBXG43TONRTGY . You are receiving this because you are subscribed to this thread.Message ID: @.***> [image: Nindaleth]Nindaleth left a comment (ggml-org/llama.cpp#12526) https://github.com/ggml-org/llama.cpp/issues/12526#issuecomment-2747777636

OK, that's both bad and good to hear.

Is some work on this ongoing in private repos? I'm neither pushing anyone nor in hurry; I'm good with "don't expect it this year" or any other realistic answer.

— Reply to this email directly, view it on GitHub https://github.com/ggml-org/llama.cpp/issues/12526#issuecomment-2747777636, or unsubscribe https://github.com/notifications/unsubscribe-auth/BQFRXQY5GXBZWJUEZQK3NUD2V7S55AVCNFSM6AAAAABZTABXLSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONBXG43TONRTGY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Mar 24 '25 11:03 zunigasllc

The problem is that the efficient Flash Attention algorithm is based on tensor core (MMA) acceleration, access to which is very limited in the VK_KHR_cooperative_matrix extension. The VK_NV_cooperative_matrix2 extension solves most of these issues, that's why there's already an implementation for it, but you can only use that extension with Nvidia and currently only with a beta driver.

It's possible to implement it (see for example https://github.com/etasnadi/VulkanCooperativeMatrixAttention), but complicated. I want to give it a shot eventually, but there's lots of more important things on my TODO list.

Mar 24 '25 13:03 0cc4m

I saw some charts on VK_NV_cooperative_matrix2 in a Phoronix article, the prompt parsing gains from it are incredible. Hopefully an eventual VK_KHR_* extension will emerge based on this.

My understanding is that this requires a specific hardware unit that (as far as AMDGPU for common mortals goes) is RDNA3+ only. Then even if that lands, I won't be able to use it with my RX 6700 XT.

That gives me all the answers I was looking for, thanks a lot for responding! Feel free to close this issue or repurpose it as a generic feature request placeholder to prevent possible further duplicates.

Mar 24 '25 14:03 Nindaleth

You are mostly right, but it's also possible to implement the algorithm without any tensor/matrix core acceleration, as was done in this project with CUDA. I forgot to mention that. You would benefit from that, but it's also a complicated implementation that I can't give any estimate on. As usual, if someone else wants to look into it, I'm happy to assist.

Mar 24 '25 14:03 0cc4m

This issue was closed because it has been inactive for 14 days since being marked as stale.

May 09 '25 01:05 github-actions[bot]

This was fixed in #13324!

May 10 '25 06:05 Nindaleth