llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Misc. bug: Flash attention on Vulkan

Open Nindaleth opened this issue 9 months ago • 6 comments

Name and Version

$ ./build/bin/llama-cli --version version: 4941 (ba932dfb) built with cc (GCC) 14.2.1 20250110 (Red Hat 14.2.1-7) for x86_64-redhat-linux

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

Other (Please specify in the next section)

Command line

./build/bin/llama-bench -ngl 99 -m models/Qwen2.5-Coder-14B-Instruct-Q4_K_L.gguf -fa 0,1
./build/bin/llama-bench -ngl 99 -m models/Qwen2.5-Coder-1.5B-Instruct-Q8_0.gguf -fa 0,1

Problem description & steps to reproduce

It seems that some FA operations are not yet handled by the Vulkan backend and fall back to CPU. But I can't find any open issue on this, only several closed ones, so maybe it's just some model archs or just my GPU that can't do it yet?

Using Mesa RADV 25.0.1 on Linux Fedora 41. AMD Radeon RX 6700 XT, gfx1030 (0x1030), VMM: no, Wave Size: 32

model size params backend ngl fa test t/s
qwen2 14B Q4_K - Medium 8.90 GiB 14.77 B ROCm 99 0 pp512 408.32 ± 0.13
qwen2 14B Q4_K - Medium 8.90 GiB 14.77 B ROCm 99 0 tg128 27.97 ± 0.01
qwen2 14B Q4_K - Medium 8.90 GiB 14.77 B ROCm 99 1 pp512 352.63 ± 3.31
qwen2 14B Q4_K - Medium 8.90 GiB 14.77 B ROCm 99 1 tg128 26.74 ± 0.01
qwen2 14B Q4_K - Medium 8.90 GiB 14.77 B Vulkan 99 0 pp512 256.50 ± 0.14
qwen2 14B Q4_K - Medium 8.90 GiB 14.77 B Vulkan 99 0 tg128 34.37 ± 0.09
qwen2 14B Q4_K - Medium 8.90 GiB 14.77 B Vulkan 99 1 pp512 100.24 ± 1.31
qwen2 14B Q4_K - Medium 8.90 GiB 14.77 B Vulkan 99 1 tg128 22.07 ± 0.03

build: ba932dfb (4941)

Seems that perf loss is more pronounced with smaller models:

model size params backend ngl fa test t/s
qwen2 1.5B Q8_0 1.53 GiB 1.54 B ROCm 99 0 pp512 4682.96 ± 2.85
qwen2 1.5B Q8_0 1.53 GiB 1.54 B ROCm 99 0 tg128 113.17 ± 0.03
qwen2 1.5B Q8_0 1.53 GiB 1.54 B ROCm 99 1 pp512 3458.94 ± 3.00
qwen2 1.5B Q8_0 1.53 GiB 1.54 B ROCm 99 1 tg128 100.59 ± 0.22
qwen2 1.5B Q8_0 1.53 GiB 1.54 B Vulkan 99 0 pp512 2778.95 ± 4.85
qwen2 1.5B Q8_0 1.53 GiB 1.54 B Vulkan 99 0 tg128 140.45 ± 0.39
qwen2 1.5B Q8_0 1.53 GiB 1.54 B Vulkan 99 1 pp512 685.72 ± 5.62
qwen2 1.5B Q8_0 1.53 GiB 1.54 B Vulkan 99 1 tg128 42.96 ± 0.83

build: ba932dfb (4941)

First Bad Commit

No response

Relevant log output


Nindaleth avatar Mar 23 '25 12:03 Nindaleth

Flash attention in Vulkan is currently only implemented for Nvidia GPUs for legacy quants on a Vulkan Beta driver. Everywhere else it falls back to CPU.

0cc4m avatar Mar 24 '25 10:03 0cc4m

OK, that's both bad and good to hear.

Is some work on this ongoing in private repos? I'm neither pushing anyone nor in hurry; I'm good with "don't expect it this year" or any other realistic answer.

Nindaleth avatar Mar 24 '25 11:03 Nindaleth

Handyman Home Remodeling working 7047636355

El lun, 24 de mar de 2025, 7:19 a. m., Nindaleth @.***> escribió:

OK, that's both bad and good to hear.

Is some work on this ongoing in private repos? I'm neither pushing anyone nor in hurry; I'm good with "don't expect it this year" or any other realistic answer.

— Reply to this email directly, view it on GitHub https://github.com/ggml-org/llama.cpp/issues/12526#issuecomment-2747777636, or unsubscribe https://github.com/notifications/unsubscribe-auth/BQFRXQY5GXBZWJUEZQK3NUD2V7S55AVCNFSM6AAAAABZTABXLSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONBXG43TONRTGY . You are receiving this because you are subscribed to this thread.Message ID: @.***> [image: Nindaleth]Nindaleth left a comment (ggml-org/llama.cpp#12526) https://github.com/ggml-org/llama.cpp/issues/12526#issuecomment-2747777636

OK, that's both bad and good to hear.

Is some work on this ongoing in private repos? I'm neither pushing anyone nor in hurry; I'm good with "don't expect it this year" or any other realistic answer.

— Reply to this email directly, view it on GitHub https://github.com/ggml-org/llama.cpp/issues/12526#issuecomment-2747777636, or unsubscribe https://github.com/notifications/unsubscribe-auth/BQFRXQY5GXBZWJUEZQK3NUD2V7S55AVCNFSM6AAAAABZTABXLSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONBXG43TONRTGY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

zunigasllc avatar Mar 24 '25 11:03 zunigasllc

The problem is that the efficient Flash Attention algorithm is based on tensor core (MMA) acceleration, access to which is very limited in the VK_KHR_cooperative_matrix extension. The VK_NV_cooperative_matrix2 extension solves most of these issues, that's why there's already an implementation for it, but you can only use that extension with Nvidia and currently only with a beta driver.

It's possible to implement it (see for example https://github.com/etasnadi/VulkanCooperativeMatrixAttention), but complicated. I want to give it a shot eventually, but there's lots of more important things on my TODO list.

0cc4m avatar Mar 24 '25 13:03 0cc4m

I saw some charts on VK_NV_cooperative_matrix2 in a Phoronix article, the prompt parsing gains from it are incredible. Hopefully an eventual VK_KHR_* extension will emerge based on this.

My understanding is that this requires a specific hardware unit that (as far as AMDGPU for common mortals goes) is RDNA3+ only. Then even if that lands, I won't be able to use it with my RX 6700 XT.

That gives me all the answers I was looking for, thanks a lot for responding! Feel free to close this issue or repurpose it as a generic feature request placeholder to prevent possible further duplicates.

Nindaleth avatar Mar 24 '25 14:03 Nindaleth

You are mostly right, but it's also possible to implement the algorithm without any tensor/matrix core acceleration, as was done in this project with CUDA. I forgot to mention that. You would benefit from that, but it's also a complicated implementation that I can't give any estimate on. As usual, if someone else wants to look into it, I'm happy to assist.

0cc4m avatar Mar 24 '25 14:03 0cc4m

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar May 09 '25 01:05 github-actions[bot]

This was fixed in #13324!

Nindaleth avatar May 10 '25 06:05 Nindaleth