Misc. bug: Flash attention on Vulkan
Name and Version
$ ./build/bin/llama-cli --version version: 4941 (ba932dfb) built with cc (GCC) 14.2.1 20250110 (Red Hat 14.2.1-7) for x86_64-redhat-linux
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
Other (Please specify in the next section)
Command line
./build/bin/llama-bench -ngl 99 -m models/Qwen2.5-Coder-14B-Instruct-Q4_K_L.gguf -fa 0,1
./build/bin/llama-bench -ngl 99 -m models/Qwen2.5-Coder-1.5B-Instruct-Q8_0.gguf -fa 0,1
Problem description & steps to reproduce
It seems that some FA operations are not yet handled by the Vulkan backend and fall back to CPU. But I can't find any open issue on this, only several closed ones, so maybe it's just some model archs or just my GPU that can't do it yet?
Using Mesa RADV 25.0.1 on Linux Fedora 41. AMD Radeon RX 6700 XT, gfx1030 (0x1030), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | fa | test | t/s |
|---|---|---|---|---|---|---|---|
| qwen2 14B Q4_K - Medium | 8.90 GiB | 14.77 B | ROCm | 99 | 0 | pp512 | 408.32 ± 0.13 |
| qwen2 14B Q4_K - Medium | 8.90 GiB | 14.77 B | ROCm | 99 | 0 | tg128 | 27.97 ± 0.01 |
| qwen2 14B Q4_K - Medium | 8.90 GiB | 14.77 B | ROCm | 99 | 1 | pp512 | 352.63 ± 3.31 |
| qwen2 14B Q4_K - Medium | 8.90 GiB | 14.77 B | ROCm | 99 | 1 | tg128 | 26.74 ± 0.01 |
| qwen2 14B Q4_K - Medium | 8.90 GiB | 14.77 B | Vulkan | 99 | 0 | pp512 | 256.50 ± 0.14 |
| qwen2 14B Q4_K - Medium | 8.90 GiB | 14.77 B | Vulkan | 99 | 0 | tg128 | 34.37 ± 0.09 |
| qwen2 14B Q4_K - Medium | 8.90 GiB | 14.77 B | Vulkan | 99 | 1 | pp512 | 100.24 ± 1.31 |
| qwen2 14B Q4_K - Medium | 8.90 GiB | 14.77 B | Vulkan | 99 | 1 | tg128 | 22.07 ± 0.03 |
build: ba932dfb (4941)
Seems that perf loss is more pronounced with smaller models:
| model | size | params | backend | ngl | fa | test | t/s |
|---|---|---|---|---|---|---|---|
| qwen2 1.5B Q8_0 | 1.53 GiB | 1.54 B | ROCm | 99 | 0 | pp512 | 4682.96 ± 2.85 |
| qwen2 1.5B Q8_0 | 1.53 GiB | 1.54 B | ROCm | 99 | 0 | tg128 | 113.17 ± 0.03 |
| qwen2 1.5B Q8_0 | 1.53 GiB | 1.54 B | ROCm | 99 | 1 | pp512 | 3458.94 ± 3.00 |
| qwen2 1.5B Q8_0 | 1.53 GiB | 1.54 B | ROCm | 99 | 1 | tg128 | 100.59 ± 0.22 |
| qwen2 1.5B Q8_0 | 1.53 GiB | 1.54 B | Vulkan | 99 | 0 | pp512 | 2778.95 ± 4.85 |
| qwen2 1.5B Q8_0 | 1.53 GiB | 1.54 B | Vulkan | 99 | 0 | tg128 | 140.45 ± 0.39 |
| qwen2 1.5B Q8_0 | 1.53 GiB | 1.54 B | Vulkan | 99 | 1 | pp512 | 685.72 ± 5.62 |
| qwen2 1.5B Q8_0 | 1.53 GiB | 1.54 B | Vulkan | 99 | 1 | tg128 | 42.96 ± 0.83 |
build: ba932dfb (4941)
First Bad Commit
No response
Relevant log output
Flash attention in Vulkan is currently only implemented for Nvidia GPUs for legacy quants on a Vulkan Beta driver. Everywhere else it falls back to CPU.
OK, that's both bad and good to hear.
Is some work on this ongoing in private repos? I'm neither pushing anyone nor in hurry; I'm good with "don't expect it this year" or any other realistic answer.
Handyman Home Remodeling working 7047636355
El lun, 24 de mar de 2025, 7:19 a. m., Nindaleth @.***> escribió:
OK, that's both bad and good to hear.
Is some work on this ongoing in private repos? I'm neither pushing anyone nor in hurry; I'm good with "don't expect it this year" or any other realistic answer.
— Reply to this email directly, view it on GitHub https://github.com/ggml-org/llama.cpp/issues/12526#issuecomment-2747777636, or unsubscribe https://github.com/notifications/unsubscribe-auth/BQFRXQY5GXBZWJUEZQK3NUD2V7S55AVCNFSM6AAAAABZTABXLSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONBXG43TONRTGY . You are receiving this because you are subscribed to this thread.Message ID: @.***> [image: Nindaleth]Nindaleth left a comment (ggml-org/llama.cpp#12526) https://github.com/ggml-org/llama.cpp/issues/12526#issuecomment-2747777636
OK, that's both bad and good to hear.
Is some work on this ongoing in private repos? I'm neither pushing anyone nor in hurry; I'm good with "don't expect it this year" or any other realistic answer.
— Reply to this email directly, view it on GitHub https://github.com/ggml-org/llama.cpp/issues/12526#issuecomment-2747777636, or unsubscribe https://github.com/notifications/unsubscribe-auth/BQFRXQY5GXBZWJUEZQK3NUD2V7S55AVCNFSM6AAAAABZTABXLSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONBXG43TONRTGY . You are receiving this because you are subscribed to this thread.Message ID: @.***>
The problem is that the efficient Flash Attention algorithm is based on tensor core (MMA) acceleration, access to which is very limited in the VK_KHR_cooperative_matrix extension. The VK_NV_cooperative_matrix2 extension solves most of these issues, that's why there's already an implementation for it, but you can only use that extension with Nvidia and currently only with a beta driver.
It's possible to implement it (see for example https://github.com/etasnadi/VulkanCooperativeMatrixAttention), but complicated. I want to give it a shot eventually, but there's lots of more important things on my TODO list.
I saw some charts on VK_NV_cooperative_matrix2 in a Phoronix article, the prompt parsing gains from it are incredible. Hopefully an eventual VK_KHR_* extension will emerge based on this.
My understanding is that this requires a specific hardware unit that (as far as AMDGPU for common mortals goes) is RDNA3+ only. Then even if that lands, I won't be able to use it with my RX 6700 XT.
That gives me all the answers I was looking for, thanks a lot for responding! Feel free to close this issue or repurpose it as a generic feature request placeholder to prevent possible further duplicates.
You are mostly right, but it's also possible to implement the algorithm without any tensor/matrix core acceleration, as was done in this project with CUDA. I forgot to mention that. You would benefit from that, but it's also a complicated implementation that I can't give any estimate on. As usual, if someone else wants to look into it, I'm happy to assist.
This issue was closed because it has been inactive for 14 days since being marked as stale.
This was fixed in #13324!