llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Misc. bug: llama-cli (vulkan backend) output gibberish with old vulkan sdk

Open franklei01 opened this issue 8 months ago • 5 comments

Name and Version

./build/bin/llama-cli --version ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = NVIDIA RTX A2000 12GB (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: none version: 5162 (2016f07b) built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-cli

Command line

./build/bin/llama-bench -m /path/to/model -ngl 99

Problem description & steps to reproduce

I am building llama.cpp with vulkan backend, the vulkan SDK I am using is a relative old one (1.3.239.0) without VK_KHR_cooperative_matrix macro defined, but this extension is supported by the underlying driver. So there is a mismatch here. In this case, llama-cli will output gibberish when answering any questions. This could be work around by setting GGML_VK_DISABLE_COOPMAT explicitly. But in my understanding, applying the following patch should be more robust?

                 pipeline_robustness = true;
             } else if (strcmp("VK_EXT_subgroup_size_control", properties.extensionName) == 0) {
                 device->subgroup_size_control = true;
+#if defined(VK_KHR_cooperative_matrix)
             } else if (strcmp("VK_KHR_cooperative_matrix", properties.extensionName) == 0 &&
                        !getenv("GGML_VK_DISABLE_COOPMAT")) {
                 device->coopmat_support = true;
                 device->coopmat_m = 0;
                 device->coopmat_n = 0;
                 device->coopmat_k = 0;
+#endif
             } else if (strcmp("VK_NV_cooperative_matrix2", properties.extensionName) == 0 &&
                        !getenv("GGML_VK_DISABLE_COOPMAT2")) {
                 coopmat2_support = true;

And if comparing the llama-bench result, the prefill performance for the gibberish version is much faster than the non gibberish one. Sorry I am a newbie here, so just wondering whether the issuing GPU workloads is totally different between two versions? Or does it mean there is a potential performance optimization opportunity?

First Bad Commit

No response

Relevant log output


franklei01 avatar Apr 21 '25 09:04 franklei01

Get the same case from my side.

@franklei01 You mean by set below env macro to 1 before run llama-cli would get vulkan output correctly? GGML_VK_DISABLE_COOPMAT

andyt9527 avatar Apr 22 '25 14:04 andyt9527

Get the same case from my side.

@franklei01 You mean by set below env macro to 1 before run llama-cli would get vulkan output correctly? GGML_VK_DISABLE_COOPMAT

Yes, or apply above patch.

franklei01 avatar Apr 23 '25 01:04 franklei01

Get the same case from my side. @franklei01 You mean by set below env macro to 1 before run llama-cli would get vulkan output correctly? GGML_VK_DISABLE_COOPMAT

Yes, or apply above patch.

But locally, I tested llama-bench with this env set, the prefill dropped to very bad

andyt9527 avatar Apr 23 '25 02:04 andyt9527

`root@localhost:~/build-vulkan/bin# taskset -c 0,5,6,7,8,9,10,11 ./llama-bench -m Qwen2.5-3B-Instruct-Q4_0.gguf -pg 128,128 -t 8 ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = Mali-G720-Immortalis (Mali-G720-Immortalis) | uma: 1 | fp16: 1 | warp size: 16 | shared memory: 32768 | int dot: 0 | matrix cores: none

model size params backend ngl threads test t/s
qwen2 3B Q4_0 1.69 GiB 3.09 B Vulkan 99 8 pp512 246.43 ± 0.26
qwen2 3B Q4_0 1.69 GiB 3.09 B Vulkan 99 8 tg128 15.31 ± 0.11
qwen2 3B Q4_0 1.69 GiB 3.09 B Vulkan 99 8 pp128+tg128 28.44 ± 0.05

build: 2c3f8b85 (5002) root@localhost:~/build-vulkan/bin# export GGML_VK_DISABLE_COOPMAT=1 root@localhost:~/build-vulkan/bin# taskset -c 0,5,6,7,8,9,10,11 ./llama-bench -m Qwen2.5-3B-Instruct-Q4_0.gguf -pg 128,128 -t 8 ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = Mali-G720-Immortalis (Mali-G720-Immortalis) | uma: 1 | fp16: 1 | warp size: 16 | shared memory: 32768 | int dot: 0 | matrix cores: none

model size params backend ngl threads test t/s
qwen2 3B Q4_0 1.69 GiB 3.09 B Vulkan 99 8 pp512 2.36 ± 0.00
qwen2 3B Q4_0 1.69 GiB 3.09 B Vulkan 99 8 tg128 15.04 ± 0.52
qwen2 3B Q4_0 1.69 GiB 3.09 B Vulkan 99 8 pp128+tg128 3.63 ± 0.00

build: 2c3f8b85 (5002)`

andyt9527 avatar Apr 23 '25 02:04 andyt9527

In this case, when building the mat mul shaders, some specialization constants (TM/TN/TK) are set to 0, and most of compute logic in mat mul are skipped. I think that's why the performance is so good (but with errors).

You could try to enable VK_KHR_cooperative_matrix with latest vulkan header and glslc to improve performance.

franklei01 avatar Apr 24 '25 08:04 franklei01

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Jun 08 '25 01:06 github-actions[bot]