Misc. bug: llama-cli (vulkan backend) output gibberish with old vulkan sdk
Name and Version
./build/bin/llama-cli --version ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = NVIDIA RTX A2000 12GB (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: none version: 5162 (2016f07b) built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-cli
Command line
./build/bin/llama-bench -m /path/to/model -ngl 99
Problem description & steps to reproduce
I am building llama.cpp with vulkan backend, the vulkan SDK I am using is a relative old one (1.3.239.0) without VK_KHR_cooperative_matrix macro defined, but this extension is supported by the underlying driver. So there is a mismatch here. In this case, llama-cli will output gibberish when answering any questions. This could be work around by setting GGML_VK_DISABLE_COOPMAT explicitly. But in my understanding, applying the following patch should be more robust?
pipeline_robustness = true;
} else if (strcmp("VK_EXT_subgroup_size_control", properties.extensionName) == 0) {
device->subgroup_size_control = true;
+#if defined(VK_KHR_cooperative_matrix)
} else if (strcmp("VK_KHR_cooperative_matrix", properties.extensionName) == 0 &&
!getenv("GGML_VK_DISABLE_COOPMAT")) {
device->coopmat_support = true;
device->coopmat_m = 0;
device->coopmat_n = 0;
device->coopmat_k = 0;
+#endif
} else if (strcmp("VK_NV_cooperative_matrix2", properties.extensionName) == 0 &&
!getenv("GGML_VK_DISABLE_COOPMAT2")) {
coopmat2_support = true;
And if comparing the llama-bench result, the prefill performance for the gibberish version is much faster than the non gibberish one. Sorry I am a newbie here, so just wondering whether the issuing GPU workloads is totally different between two versions? Or does it mean there is a potential performance optimization opportunity?
First Bad Commit
No response
Relevant log output
Get the same case from my side.
@franklei01 You mean by set below env macro to 1 before run llama-cli would get vulkan output correctly? GGML_VK_DISABLE_COOPMAT
Get the same case from my side.
@franklei01 You mean by set below env macro to 1 before run llama-cli would get vulkan output correctly? GGML_VK_DISABLE_COOPMAT
Yes, or apply above patch.
Get the same case from my side. @franklei01 You mean by set below env macro to 1 before run llama-cli would get vulkan output correctly? GGML_VK_DISABLE_COOPMAT
Yes, or apply above patch.
But locally, I tested llama-bench with this env set, the prefill dropped to very bad
`root@localhost:~/build-vulkan/bin# taskset -c 0,5,6,7,8,9,10,11 ./llama-bench -m Qwen2.5-3B-Instruct-Q4_0.gguf -pg 128,128 -t 8 ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = Mali-G720-Immortalis (Mali-G720-Immortalis) | uma: 1 | fp16: 1 | warp size: 16 | shared memory: 32768 | int dot: 0 | matrix cores: none
| model | size | params | backend | ngl | threads | test | t/s |
|---|---|---|---|---|---|---|---|
| qwen2 3B Q4_0 | 1.69 GiB | 3.09 B | Vulkan | 99 | 8 | pp512 | 246.43 ± 0.26 |
| qwen2 3B Q4_0 | 1.69 GiB | 3.09 B | Vulkan | 99 | 8 | tg128 | 15.31 ± 0.11 |
| qwen2 3B Q4_0 | 1.69 GiB | 3.09 B | Vulkan | 99 | 8 | pp128+tg128 | 28.44 ± 0.05 |
build: 2c3f8b85 (5002) root@localhost:~/build-vulkan/bin# export GGML_VK_DISABLE_COOPMAT=1 root@localhost:~/build-vulkan/bin# taskset -c 0,5,6,7,8,9,10,11 ./llama-bench -m Qwen2.5-3B-Instruct-Q4_0.gguf -pg 128,128 -t 8 ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = Mali-G720-Immortalis (Mali-G720-Immortalis) | uma: 1 | fp16: 1 | warp size: 16 | shared memory: 32768 | int dot: 0 | matrix cores: none
| model | size | params | backend | ngl | threads | test | t/s |
|---|---|---|---|---|---|---|---|
| qwen2 3B Q4_0 | 1.69 GiB | 3.09 B | Vulkan | 99 | 8 | pp512 | 2.36 ± 0.00 |
| qwen2 3B Q4_0 | 1.69 GiB | 3.09 B | Vulkan | 99 | 8 | tg128 | 15.04 ± 0.52 |
| qwen2 3B Q4_0 | 1.69 GiB | 3.09 B | Vulkan | 99 | 8 | pp128+tg128 | 3.63 ± 0.00 |
build: 2c3f8b85 (5002)`
In this case, when building the mat mul shaders, some specialization constants (TM/TN/TK) are set to 0, and most of compute logic in mat mul are skipped. I think that's why the performance is so good (but with errors).
You could try to enable VK_KHR_cooperative_matrix with latest vulkan header and glslc to improve performance.
This issue was closed because it has been inactive for 14 days since being marked as stale.