llama.cpp
llama.cpp copied to clipboard
Eval bug: Qwen3-VL-8B freezes on image processing tasks
Name and Version
${llamasvr} -m ${mpath}\Qwen3-VL-8B-Instruct-UD-Q4_K_XL.gguf --mmproj ${mpath}\Qwen3-VL-8B-mmproj-BF16.gguf --no-mmap --ctx-size 16000 --jinja --temp 0.7 --top-p 0.8 --top-k 20 --repeat-penalty 1.0 --presence-penalty 1.5
Version: b6948
Windows Vulkan precompiled binaries.
Operating systems
Windows
GGML backends
Vulkan
Hardware
Ryzen 890m
Models
Qwen3-VL-8B
Problem description & steps to reproduce
Last working version is roughly b6910 and earlier. I didn't test all versions to see where the breakdown happened.
Qwen3-VL-8B freezes as processing image.. Text responses work fine. Image uploads to the webui or Open-WebUI all result in a complete freeze, and sometimes windows locks down.
First Bad Commit
No response
Relevant log output
srv params_from_: Chat format: Hermes 2 Pro
slot get_availabl: id 3 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id 3 | task 1 | processing task
slot update_slots: id 3 | task 1 | new prompt, n_ctx_slot = 16000, n_keep = 0, task.n_tokens = 2044
slot update_slots: id 3 | task 1 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 3 | task 1 | prompt processing progress, n_tokens = 10, batch.n_tokens = 10, progress = 0.004892
slot update_slots: id 3 | task 1 | n_tokens = 10, memory_seq_rm [10, end)
srv process_chun: processing image...
https://github.com/ggml-org/llama.cpp/pull/16878 https://github.com/ggml-org/llama.cpp/pull/16921 I think the slow speed is due to image preprocessing.
#16878 #16921 I think the slow speed is due to image preprocessing.
These sound like enhancements. The b6910 release works perfectly fine. Looks like something changed in the last 4 days that caused the freeze. The freeze takes minutes and the machine locks up, so it's not working at all.
These sound like enhancements. The b6910 release works perfectly fine. Looks like something changed in the last 4 days that caused the freeze. The freeze takes minutes and the machine locks up, so it's not working at all.
This will scale the image dimensions (usually upscaling), which increases the required processing time. If your device is less powerful, the total processing time can become significantly longer. You could try reducing the --image-min/max-tokens value.
You can Try --image-min-tokens 8 --image-max-tokens 1024 or --image-min-tokens 8 --image-max-tokens 512
That makes sense. I will try your suggestion once am off work.
I tried limiting the number of tokens, and it's working but still far slower. For ex, at 512 tokens it's 3 times slower than previous llama.cpp builds.
I saw today a change that allowed Qwen3-VL model to run full res up to 4096 tokens, up from 2048. So I ran an experiment:
- Updated llama.cpp to the latest binaries from 1 hr ago
- Set max tokens to 2048
--image-max-tokens 2048 - The performance tanked and the image process took about a minute
This proves that something else was change that tanked the performance, or maybe something in my parameters is throwing it off, but you saw above I don't have anything out of the ordinary.
If a build from 4-5 days ago processes an image at 2048 max tokens in under 5 seconds, but today's build at the same resolution takes ~60 seconds, there's another issue.
I tested with Gemma3-12B and it suffers the same issue but to a lesser extent. the mmproj file for Gemma3 is smaller than Qwen3-VL-8B mmproj file, so that can have something to do with it.
@wqerrewetw I found the PR that broke everything. It's under this build b6915
Also noticed that when I set the max tokens to 512, the recognition quality tanks, and 1024 is multiple times slower, something is not right. At least with the Vulkan backend.
I see the same slowdown/freeze using Ubuntu and CUDA backend.
I’m surprised there isn’t a larger crowd reporting this behavior already.
I found something odd with GPU utilization in Windows. Noticed that when image processing part takes place, it's using the 3D compute, not the Compute 0 as shown in the screenshot below:
I remarked that image processing was quite slower yes, almost two times slower.
I occurred the same problem
我遇到的问题更加奇怪,backend是sycl,使用intel 125H的igpu计算,只进行文本问答,推理正常,到只要输入图片,程序马上奔溃,报错如下: srv process_chun: processing image... encoding image slice... Segmentation fault (core dumped)
@spf1983 Could you share the whole log of your case with SYCL backend? And provide the cmd with parameters too.
@NeoZhangJianyu
启动命令:
./build/bin/llama-server -t 8 -c 10240 --host 0.0.0.0 --port 9999 -ngl 0 --batch-size 2048 -fa on --no-warmup
-m /data_n002/models/Qwen3-VL-2B-Instruct-GGUF/Qwen3VL-2B-Instruct-Q4_K_M.gguf
--mmproj /data_n002/models/Qwen3-VL-2B-Instruct-GGUF/mmproj-Qwen3VL-2B-Instruct-Q8_0.gguf
执行日志: main: setting n_parallel = 4 and kv_unified = true (add -kvu to disable this) build: 7031 (655cddd17) with Intel(R) oneAPI DPC++/C++ Compiler 2025.2.0 (2025.2.0.20250605) for x86_64-unknown-linux-gnu system info: n_threads = 8, n_threads_batch = 8, total_threads = 18
system_info: n_threads = 8 (n_threads_batch = 8) / 18 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
main: binding port with default address family main: HTTP server is listening, hostname: 0.0.0.0, port: 9999, http threads: 17 main: loading model srv load_model: loading model '/data_n002/models/Qwen3-VL-2B-Instruct-GGUF/Qwen3VL-2B-Instruct-Q4_K_M.gguf' get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory llama_model_load_from_file_impl: using device SYCL0 (Intel(R) Graphics) (unknown id) - 14130 MiB free llama_model_loader: loaded meta data with 32 key-value pairs and 310 tensors from /data_n002/models/Qwen3-VL-2B-Instruct-GGUF/Qwen3VL-2B-Instruct-Q4_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3vl llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3Vl 2b Instruct llama_model_loader: - kv 3: general.finetune str = instruct llama_model_loader: - kv 4: general.basename str = qwen3vl llama_model_loader: - kv 5: general.size_label str = 2B llama_model_loader: - kv 6: general.license str = apache-2.0 llama_model_loader: - kv 7: general.tags arr[str,1] = ["image-text-to-text"] llama_model_loader: - kv 8: qwen3vl.block_count u32 = 28 llama_model_loader: - kv 9: qwen3vl.context_length u32 = 262144 llama_model_loader: - kv 10: qwen3vl.embedding_length u32 = 2048 llama_model_loader: - kv 11: qwen3vl.feed_forward_length u32 = 6144 llama_model_loader: - kv 12: qwen3vl.attention.head_count u32 = 16 llama_model_loader: - kv 13: qwen3vl.attention.head_count_kv u32 = 8 llama_model_loader: - kv 14: qwen3vl.rope.freq_base f32 = 5000000.000000 llama_model_loader: - kv 15: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 16: qwen3vl.attention.key_length u32 = 128 llama_model_loader: - kv 17: qwen3vl.attention.value_length u32 = 128 llama_model_loader: - kv 18: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0] llama_model_loader: - kv 19: qwen3vl.n_deepstack_layers u32 = 3 llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 21: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ... llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 29: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 30: general.quantization_version u32 = 2 llama_model_loader: - kv 31: general.file_type u32 = 15 llama_model_loader: - type f32: 113 tensors llama_model_loader: - type q4_K: 168 tensors llama_model_loader: - type q6_K: 29 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 1.03 GiB (5.12 BPW) load: printing all EOG tokens: load: - 151643 ('<|endoftext|>') load: - 151645 ('<|im_end|>') load: - 151662 ('<|fim_pad|>') load: - 151663 ('<|repo_name|>') load: - 151664 ('<|file_sep|>') load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3vl print_info: vocab_only = 0 print_info: n_ctx_train = 262144 print_info: n_embd = 2048 print_info: n_embd_inp = 8192 print_info: n_layer = 28 print_info: n_head = 16 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: is_swa_any = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 2 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 6144 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: n_expert_groups = 0 print_info: n_group_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 40 print_info: rope scaling = linear print_info: freq_base_train = 5000000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 262144 print_info: rope_finetuned = unknown print_info: mrope sections = [24, 20, 20, 0] print_info: model type = 1.7B print_info: model params = 1.72 B print_info: general.name = Qwen3Vl 2b Instruct print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory load_tensors: offloading 0 repeating layers to GPU load_tensors: offloaded 0/29 layers to GPU load_tensors: CPU_Mapped model buffer size = 1050.43 MiB .............................................................................. llama_context: constructing llama_context llama_context: n_seq_max = 4 llama_context: n_ctx = 10240 llama_context: n_ctx_seq = 10240 llama_context: n_batch = 2048 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = enabled llama_context: kv_unified = true llama_context: freq_base = 5000000.0 llama_context: freq_scale = 1 llama_context: n_ctx_seq (10240) < n_ctx_train (262144) -- the full capacity of the model will not be utilized Running with Environment Variables: GGML_SYCL_DEBUG: 0 GGML_SYCL_DISABLE_OPT: 0 GGML_SYCL_DISABLE_GRAPH: 1 GGML_SYCL_DISABLE_DNN: 0 GGML_SYCL_PRIORITIZE_DMMV: 0 Build with Macros: GGML_SYCL_FORCE_MMQ: no GGML_SYCL_F16: yes Found 1 SYCL devices: | | | | |Max | |Max |Global | | | | | | |compute|Max work|sub |mem | |
| ID | Device Type | Name | Version | units | group | group | size | Driver version |
|---|---|---|---|---|---|---|---|---|
| 0 | [opencl:gpu:0] | Intel Graphics | 3.0 | 112 | 1024 | 32 | 14816M | 24.35.30872.22 |
| SYCL Optimization Feature: | ||||||||
| ID | Device Type | Reorder | ||||||
| -- | ------------------- | ------- | ||||||
| 0 | [opencl:gpu:0] | Y |
llama_context: CPU output buffer size = 2.32 MiB
llama_kv_cache: CPU KV buffer size = 1120.00 MiB
llama_kv_cache: size = 1120.00 MiB ( 10240 cells, 28 layers, 4/1 seqs), K (f16): 560.00 MiB, V (f16): 560.00 MiB
llama_context: SYCL0 compute buffer size = 558.02 MiB
llama_context: SYCL_Host compute buffer size = 24.02 MiB
llama_context: graph nodes = 987
llama_context: graph splits = 342 (with bs=512), 1 (with bs=1)
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 10240
clip_model_loader: model name: Qwen3Vl 2b Instruct
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment: 32
clip_model_loader: n_tensors: 316
clip_model_loader: n_kv: 25
clip_model_loader: has vision encoder clip_ctx: CLIP using SYCL0 backend load_hparams: Qwen-VL models require at minimum 1024 image tokens to function correctly on grounding tasks load_hparams: if you encounter problems with accuracy, try adding --image-min-tokens 1024 load_hparams: more info: https://github.com/ggml-org/llama.cpp/issues/16842
load_hparams: projector: qwen3vl_merger load_hparams: n_embd: 1024 load_hparams: n_head: 16 load_hparams: n_ff: 4096 load_hparams: n_layer: 24 load_hparams: ffn_op: gelu load_hparams: projection_dim: 2048
--- vision hparams --- load_hparams: image_size: 768 load_hparams: patch_size: 16 load_hparams: has_llava_proj: 0 load_hparams: minicpmv_version: 0 load_hparams: n_merge: 2 load_hparams: n_wa_pattern: 0 load_hparams: image_min_pixels: 8192 load_hparams: image_max_pixels: 4194304
load_hparams: model size: 424.42 MiB
load_hparams: metadata size: 0.11 MiB
alloc_compute_meta: warmup with image size = 1472 x 1472
alloc_compute_meta: SYCL0 compute buffer size = 330.75 MiB
alloc_compute_meta: CPU compute buffer size = 99.19 MiB
alloc_compute_meta: graph splits = 51, nodes = 766
warmup: *****************************************************************
warmup: WARNING: flash attention not supported by SYCL0, memory usage will increase
warmup: op params:
warmup: dst: type = f32, ne = [64 16 8464 1], nb = [4 256 4096 34668544]
warmup: src0: type = f32, ne = [64 8464 16 1], nb = [4 4096 256 34668544]
warmup: src1: type = f16, ne = [64 8464 16 1], nb = [2 128 1083392 17334272]
warmup: src2: type = f16, ne = [64 8464 16 1], nb = [2 128 1083392 17334272]
warmup: please report this on github as an issue
warmup: *****************************************************************
alloc_compute_meta: warmup with image size = 1472 x 1472
ggml_backend_sycl_buffer_type_alloc_buffer: can't allocate 4584914944 Bytes of memory on device
ggml_gallocr_reserve_n: failed to allocate SYCL0 buffer of size 9412645120
alloc_compute_meta: CPU compute buffer size = 99.19 MiB
alloc_compute_meta: graph splits = 3, nodes = 814
warmup: flash attention is disabled
warmup: *****************************************************************
warmup: WARNING: the CLIP graph uses unsupported operators by the backend
warmup: the performance will be suboptimal
warmup: list of unsupported ops (backend=SYCL0):
warmup: UPSCALE: type = f32, ne = [92 92 1024 1]
warmup: SOFT_MAX: type = f32, ne = [8464 8464 16 1]
warmup: CONT: type = f32, ne = [8464 64 16 1]
warmup: PERMUTE: type = f32, ne = [64 8464 16 1]
warmup: ROPE: type = f32, ne = [64 16 8464 1]
warmup: VIEW: type = f32, ne = [64 16 8464 1]
warmup: VIEW: type = f32, ne = [64 16 8464 1]
warmup: MUL_MAT: type = f32, ne = [3072 8464 1 1]
warmup: MUL: type = f32, ne = [1024 8464 1 1]
warmup: ADD: type = f32, ne = [1024 8464 1 1]
warmup: MUL_MAT: type = f32, ne = [1024 8464 1 1]
warmup: ADD: type = f32, ne = [4096 8464 1 1]
warmup: ADD: type = f32, ne = [1024 8464 1 1]
warmup: NORM: type = f32, ne = [4096 2116 1 1]
warmup: ADD: type = f32, ne = [1024 8464 1 1]
warmup: CONT: type = f32, ne = [1024 8464 1 1]
warmup: MUL_MAT: type = f32, ne = [64 8464 16 1]
warmup: MUL_MAT: type = f32, ne = [8464 8464 16 1]
warmup: PERMUTE: type = f32, ne = [8464 64 16 1]
warmup: ADD: type = f32, ne = [1024 8464 1 1]
warmup: ROPE: type = f32, ne = [64 16 8464 1]
warmup: VIEW: type = f32, ne = [64 16 8464 1]
warmup: ADD: type = f32, ne = [3072 8464 1 1]
warmup: ADD: type = f32, ne = [1024 8464 1 1]
warmup: NORM: type = f32, ne = [1024 8464 1 1]
warmup: flash attention is disabled
warmup: please report this on github as an issue
warmup: ref: https://github.com/ggml-org/llama.cpp/pull/16837#issuecomment-3461676118
warmup: *****************************************************************
srv load_model: loaded multimodal model, '/data_n002/models/Qwen3-VL-2B-Instruct-GGUF/mmproj-Qwen3VL-2B-Instruct-Q8_0.gguf'
srv init: initializing slots, n_slots = 4
slot init: id 0 | task -1 | new slot, n_ctx = 10240
slot init: id 1 | task -1 | new slot, n_ctx = 10240
slot init: id 2 | task -1 | new slot, n_ctx = 10240
slot init: id 3 | task -1 | new slot, n_ctx = 10240
srv init: prompt cache is enabled, size limit: 8192 MiB
srv init: use --cache-ram 0 to disable the prompt cache
srv init: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
srv init: thinking = 0
main: model loaded
main: chat template, chat_template: {%- if tools %}
{{- '<|im_start|>system\n' }}
{%- if messages[0].role == 'system' %}
{%- if messages[0].content is string %}
{{- messages[0].content }}
{%- else %}
{%- for content in messages[0].content %}
{%- if 'text' in content %}
{{- content.text }}
{%- endif %}
{%- endfor %}
{%- endif %}
{{- '\n\n' }}
{%- endif %}
{{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within