llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Misc. bug: Vulkan output is gibberish

Open mimi89999 opened this issue 1 month ago • 15 comments

Name and Version

$ ./llama-cli --version
load_backend: loaded RPC backend from /home/michel/llm/llama-b6989-bin-ubuntu-vulkan-x64/build/bin/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics (TGL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from /home/michel/llm/llama-b6989-bin-ubuntu-vulkan-x64/build/bin/libggml-vulkan.so
load_backend: loaded CPU backend from /home/michel/llm/llama-b6989-bin-ubuntu-vulkan-x64/build/bin/libggml-cpu-icelake.so
version: 6989 (eeee367de)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-cli

Command line

./llama-cli -hf ggml-org/SmolLM3-3B-GGUF -p "Hello"

Problem description & steps to reproduce

The output I got is gibberish:

user
Hello
assistant
upported D F;

 D R P23D323D4 PP
D3D3DDD
> 

When using the CPU backend bu setting -ngl 0 or simply by removing the Vulkan module, my output is correct

First Bad Commit

No response

Relevant log output

load_backend: loaded RPC backend from /home/michel/llm/llama-b6989-bin-ubuntu-vulkan-x64/build/bin/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics (TGL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from /home/michel/llm/llama-b6989-bin-ubuntu-vulkan-x64/build/bin/libggml-vulkan.so
load_backend: loaded CPU backend from /home/michel/llm/llama-b6989-bin-ubuntu-vulkan-x64/build/bin/libggml-cpu-icelake.so
* Host huggingface.co:443 was resolved.
* IPv6: 2600:9000:2436:1200:17:b174:6d00:93a1, 2600:9000:2436:1800:17:b174:6d00:93a1, 2600:9000:2436:4e00:17:b174:6d00:93a1, 2600:9000:2436:b400:17:b174:6d00:93a1, 2600:9000:2436:8000:17:b174:6d00:93a1, 2600:9000:2436:ba00:17:b174:6d00:93a1, 2600:9000:2436:c200:17:b174:6d00:93a1, 2600:9000:2436:5800:17:b174:6d00:93a1
* IPv4: 108.138.51.26, 108.138.51.21, 108.138.51.8, 108.138.51.41
*   Trying [2600:9000:2436:1200:17:b174:6d00:93a1]:443...
* ALPN: curl offers h2,http/1.1
* SSL Trust Anchors:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
*   CApath: /etc/ssl/certs
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256 / X25519MLKEM768 / RSASSA-PSS
* ALPN: server accepted h2
* Server certificate:
*   subject: CN=huggingface.co
*   start date: Apr 13 00:00:00 2025 GMT
*   expire date: May 12 23:59:59 2026 GMT
*   issuer: C=US; O=Amazon; CN=Amazon RSA 2048 M02
*   Certificate level 0: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
*   Certificate level 1: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
*   Certificate level 2: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
*   subjectAltName: "huggingface.co" matches cert's "huggingface.co"
* SSL certificate verified via OpenSSL.
* Established connection to huggingface.co (2600:9000:2436:1200:17:b174:6d00:93a1 port 443) from 2606:4700:110:886d:e733:73d2:1a78:aed9 port 60996 
* using HTTP/2
* [HTTP/2] [1] OPENED stream for https://huggingface.co/v2/ggml-org/SmolLM3-3B-GGUF/manifests/latest
* [HTTP/2] [1] [:method: GET]
* [HTTP/2] [1] [:scheme: https]
* [HTTP/2] [1] [:authority: huggingface.co]
* [HTTP/2] [1] [:path: /v2/ggml-org/SmolLM3-3B-GGUF/manifests/latest]
* [HTTP/2] [1] [user-agent: llama-cpp]
* [HTTP/2] [1] [accept: application/json]
> GET /v2/ggml-org/SmolLM3-3B-GGUF/manifests/latest HTTP/2
Host: huggingface.co
User-Agent: llama-cpp
Accept: application/json

* Request completely sent off
< HTTP/2 200 
< content-type: application/json; charset=utf-8
< content-length: 976
< date: Sat, 08 Nov 2025 15:37:19 GMT
< etag: W/"3d0-7FgnnKEkDoOon2Kc6uh711e78Lk"
< x-powered-by: huggingface-moon
< x-request-id: Root=1-690f63af-7772639f425bcd6d7e052116
< ratelimit: "pages";r=99;t=221
< ratelimit-policy: "fixed window";"pages";q=100;w=300
< cross-origin-opener-policy: same-origin
< referrer-policy: strict-origin-when-cross-origin
< access-control-max-age: 86400
< access-control-allow-origin: https://huggingface.co
< vary: Origin
< access-control-expose-headers: X-Repo-Commit,X-Request-Id,X-Error-Code,X-Error-Message,X-Total-Count,ETag,Link,Accept-Ranges,Content-Range,X-Linked-Size,X-Linked-ETag,X-Xet-Hash
< x-cache: Miss from cloudfront
< via: 1.1 ca098aee4fd72030e464a2f263541478.cloudfront.net (CloudFront)
< x-amz-cf-pop: WAW51-P2
< x-amz-cf-id: RjEdqRQiQiUA1VhiHfocZqNbeoaA_q5hf8kQzchqFeO1bKbP4zg3pQ==
< 
* Connection #0 to host huggingface.co:443 left intact
common_download_file_single_online: using cached file: /home/michel/.cache/llama.cpp/ggml-org_SmolLM3-3B-GGUF_SmolLM3-Q4_K_M.gguf
build: 6989 (eeee367de) with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (Intel(R) Iris(R) Xe Graphics (TGL GT2)) (0000:00:02.0) - 6861 MiB free
llama_model_loader: loaded meta data with 26 key-value pairs and 326 tensors from /home/michel/.cache/llama.cpp/ggml-org_SmolLM3-3B-GGUF_SmolLM3-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = smollm3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                         general.size_label str              = 3.1B
llama_model_loader: - kv   3:                            general.license str              = apache-2.0
llama_model_loader: - kv   4:                          general.languages arr[str,8]       = ["en", "fr", "es", "it", "pt", "zh", ...
llama_model_loader: - kv   5:                        smollm3.block_count u32              = 36
llama_model_loader: - kv   6:                     smollm3.context_length u32              = 65536
llama_model_loader: - kv   7:                   smollm3.embedding_length u32              = 2048
llama_model_loader: - kv   8:                smollm3.feed_forward_length u32              = 11008
llama_model_loader: - kv   9:               smollm3.attention.head_count u32              = 16
llama_model_loader: - kv  10:            smollm3.attention.head_count_kv u32              = 4
llama_model_loader: - kv  11:                     smollm3.rope.freq_base f32              = 5000000,000000
llama_model_loader: - kv  12:   smollm3.attention.layer_norm_rms_epsilon f32              = 0,000001
llama_model_loader: - kv  13:                         smollm3.vocab_size u32              = 128256
llama_model_loader: - kv  14:               smollm3.rope.dimension_count u32              = 128
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = smaug-bpe
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 128012
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 128012
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {# ───── defaults ───...
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - kv  25:                          general.file_type u32              = 15
llama_model_loader: - type  f32:   73 tensors
llama_model_loader: - type q4_K:  216 tensors
llama_model_loader: - type q6_K:   37 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 1,78 GiB (4,96 BPW) 
load: printing all EOG tokens:
load:   - 128001 ('<|end_of_text|>')
load:   - 128008 ('<|eom_id|>')
load:   - 128009 ('<|eot_id|>')
load:   - 128012 ('<|im_end|>')
load: special tokens cache size = 256
load: token to piece cache size = 0,7997 MB
print_info: arch             = smollm3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 65536
print_info: n_embd           = 2048
print_info: n_embd_inp       = 2048
print_info: n_layer          = 36
print_info: n_head           = 16
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0,0e+00
print_info: f_norm_rms_eps   = 1,0e-06
print_info: f_clamp_kqv      = 0,0e+00
print_info: f_max_alibi_bias = 0,0e+00
print_info: f_logit_scale    = 0,0e+00
print_info: f_attn_scale     = 0,0e+00
print_info: n_ff             = 11008
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 5000000,0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 65536
print_info: rope_finetuned   = unknown
print_info: model type       = 3B
print_info: model params     = 3,08 B
print_info: general.name     = n/a
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128012 '<|im_end|>'
print_info: EOT token        = 128012 '<|im_end|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: PAD token        = 128012 '<|im_end|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128001 '<|end_of_text|>'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: EOG token        = 128012 '<|im_end|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 36 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   205,49 MiB
load_tensors:      Vulkan0 model buffer size =  1819,10 MiB
..................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_seq     = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 5000000,0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (4096) < n_ctx_train (65536) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host  output buffer size =     0,49 MiB
llama_kv_cache:    Vulkan0 KV buffer size =   288,00 MiB
llama_kv_cache: size =  288,00 MiB (  4096 cells,  36 layers,  1/1 seqs), K (f16):  144,00 MiB, V (f16):  144,00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context:    Vulkan0 compute buffer size =   254,50 MiB
llama_context: Vulkan_Host compute buffer size =    12,02 MiB
llama_context: graph nodes  = 1105
llama_context: graph splits = 2
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|eom_id|> logit bias = -inf
common_init_from_params: added <|eot_id|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 4
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
*** User-specified prompt will pre-start conversation, did you mean to set --system-prompt (-sys) instead?
main: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant


system_info: n_threads = 4 (n_threads_batch = 4) / 8 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

main: interactive mode on.
sampler seed: 1112204385
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
	dry_multiplier = 0,000, dry_base = 1,750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0,950, min_p = 0,050, xtc_probability = 0,000, xtc_threshold = 0,100, typical_p = 1,000, top_n_sigma = -1,000, temp = 0,800
	mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT

user
Hello
assistant
upported D F;

 D R P23D323D4 PP
D3D3DDD
> 
llama_perf_sampler_print:    sampling time =       3,15 ms /    35 runs   (    0,09 ms per token, 11097,02 tokens per second)
llama_perf_context_print:        load time =    2552,59 ms
llama_perf_context_print: prompt eval time =     351,46 ms /     9 tokens (   39,05 ms per token,    25,61 tokens per second)
llama_perf_context_print:        eval time =    3254,03 ms /    25 runs   (  130,16 ms per token,     7,68 tokens per second)
llama_perf_context_print:       total time =    4410,05 ms /    34 tokens
llama_perf_context_print:    graphs reused =         25
llama_memory_breakdown_print: | memory breakdown [MiB]                               | total   free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - Vulkan0 (Intel(R) Iris(R) Xe Graphics (TGL GT2)) | 11771 = 4421 + (2361 =  1819 +     288 +     254) +        4989 |
llama_memory_breakdown_print: |   - Host                                             |                  217 =   205 +       0 +      12                |
Interrupted by user

mimi89999 avatar Nov 08 '25 15:11 mimi89999

Let me know if it's actually not the same issue.

0cc4m avatar Nov 08 '25 18:11 0cc4m

This is actually a different issue, specific to Intel. I can reproduce it, but it doesn't make much sense: If I add a test that does the same thing as the operation that's failing in the model, the test passes. Not sure what is going on.

0cc4m avatar Nov 08 '25 20:11 0cc4m

What's that test?

mimi89999 avatar Nov 08 '25 20:11 mimi89999

test_cases.emplace_back(new test_mul_mat(GGML_TYPE_Q4_K, GGML_TYPE_F32, 2048, 512, 2048, {1,1}, {1,1}));

You can add that right after the one I added in the PR. But for me, it passes, despite failing inside of the model.

0cc4m avatar Nov 08 '25 20:11 0cc4m

The test also passes for me: MUL_MAT(type_a=q4_K,type_b=f32,m=2048,n=512,k=2048,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): OK

mimi89999 avatar Nov 08 '25 20:11 mimi89999

I have the same issue in the model bartowski/L3-8B-Stheno-v3.2-GGUF

mimi89999 avatar Nov 08 '25 21:11 mimi89999

For me GGML_VK_DISABLE_F16 also fixes it, despite not having much of an effect on the shader itself. Maybe this is some kind of obscure driver issue.

0cc4m avatar Nov 08 '25 21:11 0cc4m

To report a bug in Mesa or the kernel, I would need to have a minimal test case. 😦

Setting GGML_VK_DISABLE_F16 also fixed the issue for me.

mimi89999 avatar Nov 08 '25 21:11 mimi89999

VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/lvp_icd.json GGML_VK_VISIBLE_DEVICES=0 ./bin/llama-cli -hf ggml-org/SmolLM3-3B-GGUF -ngl 99

I tried using the llvmpipe driver and the model worked correctly on it. This issue does seem to be specific to that Intel driver.

mimi89999 avatar Nov 08 '25 22:11 mimi89999

Can reproduce with an Arc A750 with mesa 25.2.4 on 86fde91e62c3f72ab7ed8a540dc1be049b735477, GGML_VK_DISABLE_F16 also fixed it for me.

TinyServal avatar Nov 09 '25 09:11 TinyServal

I notice that commenting out CREATE_MMQ(GGML_TYPE_Q4_K, pipeline_dequant_mul_mat_mat_q8_1[GGML_TYPE_Q4_K], matmul_q4_k_q8_1, mmq_wg_denoms, warptile_mmq_int_k, vk_mat_mat_push_constants, 3, , 0); and CREATE_MMQ(GGML_TYPE_Q6_K, pipeline_dequant_mul_mat_mat_q8_1[GGML_TYPE_Q6_K], matmul_q6_k_q8_1, mmq_wg_denoms, warptile_mmq_int_k, vk_mat_mat_push_constants, 3, , 0); also resolves the issue.

mimi89999 avatar Nov 09 '25 21:11 mimi89999

For me GGML_VK_DISABLE_F16 also fixes it, despite not having much of an effect on the shader itself. Maybe this is some kind of obscure driver issue.

@0cc4m I see that when setting GGML_VK_DISABLE_F16, the entire section of code containing CREATE_MMQ(GGML_TYPE_Q4_K, pipeline_dequant_mul_mat_mat_q8_1[GGML_TYPE_Q4_K], matmul_q4_k_q8_1, mmq_wg_denoms, warptile_mmq_int_k, vk_mat_mat_push_constants, 3, , 0); is not run.

On llvmpipe device->integer_dot_product is false.

I wonder if those differences could explain why the code runs on some drivers, but fails on others.

mimi89999 avatar Nov 09 '25 23:11 mimi89999

Indeed, I see that REATE_MMQ(GGML_TYPE_Q4_K, pipeline_dequant_mul_mat_mat_q8_1[GGML_TYPE_Q4_K], matmul_q4_k_q8_1, mmq_wg_denoms, warptile_mmq_int_k, vk_mat_mat_push_constants, 3, , 0); is in the if block if (device->fp16) {. When setting GGML_VK_DISABLE_F16, device->fp16 becomes false

mimi89999 avatar Nov 09 '25 23:11 mimi89999

dose an equivalent of apitrace to manufacture reducers of this kind of problem exist for vulkan?

IMbackK avatar Nov 11 '25 10:11 IMbackK

looks like yes: https://github.com/LunarG/gfxreconstruct/blob/dev/USAGE_desktop_Vulkan.md

IMbackK avatar Nov 11 '25 10:11 IMbackK