llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Eval bug: Qwen3-VL-8B freezes on image processing tasks

Open jhemmond opened this issue 2 weeks ago • 15 comments
trafficstars

Name and Version

${llamasvr} -m ${mpath}\Qwen3-VL-8B-Instruct-UD-Q4_K_XL.gguf --mmproj ${mpath}\Qwen3-VL-8B-mmproj-BF16.gguf --no-mmap --ctx-size 16000 --jinja --temp 0.7 --top-p 0.8 --top-k 20 --repeat-penalty 1.0 --presence-penalty 1.5

Version: b6948

Windows Vulkan precompiled binaries.

Operating systems

Windows

GGML backends

Vulkan

Hardware

Ryzen 890m

Models

Qwen3-VL-8B

Problem description & steps to reproduce

Last working version is roughly b6910 and earlier. I didn't test all versions to see where the breakdown happened.

Qwen3-VL-8B freezes as processing image.. Text responses work fine. Image uploads to the webui or Open-WebUI all result in a complete freeze, and sometimes windows locks down.

First Bad Commit

No response

Relevant log output

srv  params_from_: Chat format: Hermes 2 Pro

slot get_availabl: id  3 | task -1 | selected slot by LRU, t_last = -1

slot launch_slot_: id  3 | task 1 | processing task

slot update_slots: id  3 | task 1 | new prompt, n_ctx_slot = 16000, n_keep = 0, task.n_tokens = 2044

slot update_slots: id  3 | task 1 | n_tokens = 0, memory_seq_rm [0, end)

slot update_slots: id  3 | task 1 | prompt processing progress, n_tokens = 10, batch.n_tokens = 10, progress = 0.004892

slot update_slots: id  3 | task 1 | n_tokens = 10, memory_seq_rm [10, end)

srv  process_chun: processing image...

jhemmond avatar Nov 04 '25 23:11 jhemmond

https://github.com/ggml-org/llama.cpp/pull/16878 https://github.com/ggml-org/llama.cpp/pull/16921 I think the slow speed is due to image preprocessing.

wqerrewetw avatar Nov 05 '25 12:11 wqerrewetw

#16878 #16921 I think the slow speed is due to image preprocessing.

These sound like enhancements. The b6910 release works perfectly fine. Looks like something changed in the last 4 days that caused the freeze. The freeze takes minutes and the machine locks up, so it's not working at all.

jhemmond avatar Nov 05 '25 14:11 jhemmond

These sound like enhancements. The b6910 release works perfectly fine. Looks like something changed in the last 4 days that caused the freeze. The freeze takes minutes and the machine locks up, so it's not working at all.

This will scale the image dimensions (usually upscaling), which increases the required processing time. If your device is less powerful, the total processing time can become significantly longer. You could try reducing the --image-min/max-tokens value.

wqerrewetw avatar Nov 05 '25 15:11 wqerrewetw

You can Try --image-min-tokens 8 --image-max-tokens 1024 or --image-min-tokens 8 --image-max-tokens 512

wqerrewetw avatar Nov 05 '25 15:11 wqerrewetw

That makes sense. I will try your suggestion once am off work.

jhemmond avatar Nov 05 '25 19:11 jhemmond

I tried limiting the number of tokens, and it's working but still far slower. For ex, at 512 tokens it's 3 times slower than previous llama.cpp builds.

I saw today a change that allowed Qwen3-VL model to run full res up to 4096 tokens, up from 2048. So I ran an experiment:

  • Updated llama.cpp to the latest binaries from 1 hr ago
  • Set max tokens to 2048 --image-max-tokens 2048
  • The performance tanked and the image process took about a minute

This proves that something else was change that tanked the performance, or maybe something in my parameters is throwing it off, but you saw above I don't have anything out of the ordinary.

If a build from 4-5 days ago processes an image at 2048 max tokens in under 5 seconds, but today's build at the same resolution takes ~60 seconds, there's another issue.

I tested with Gemma3-12B and it suffers the same issue but to a lesser extent. the mmproj file for Gemma3 is smaller than Qwen3-VL-8B mmproj file, so that can have something to do with it.

jhemmond avatar Nov 06 '25 01:11 jhemmond

@wqerrewetw I found the PR that broke everything. It's under this build b6915

16878.

Also noticed that when I set the max tokens to 512, the recognition quality tanks, and 1024 is multiple times slower, something is not right. At least with the Vulkan backend.

jhemmond avatar Nov 06 '25 01:11 jhemmond

I see the same slowdown/freeze using Ubuntu and CUDA backend.

deadprogram avatar Nov 08 '25 16:11 deadprogram

I’m surprised there isn’t a larger crowd reporting this behavior already.

jhemmond avatar Nov 08 '25 16:11 jhemmond

I found something odd with GPU utilization in Windows. Noticed that when image processing part takes place, it's using the 3D compute, not the Compute 0 as shown in the screenshot below:

Image

jhemmond avatar Nov 09 '25 23:11 jhemmond

I remarked that image processing was quite slower yes, almost two times slower.

aviallon avatar Nov 11 '25 15:11 aviallon

I occurred the same problem

Gaoeee avatar Nov 12 '25 02:11 Gaoeee

我遇到的问题更加奇怪,backend是sycl,使用intel 125H的igpu计算,只进行文本问答,推理正常,到只要输入图片,程序马上奔溃,报错如下: srv process_chun: processing image... encoding image slice... Segmentation fault (core dumped)

spf1983 avatar Nov 14 '25 03:11 spf1983

@spf1983 Could you share the whole log of your case with SYCL backend? And provide the cmd with parameters too.

NeoZhangJianyu avatar Nov 14 '25 04:11 NeoZhangJianyu

@NeoZhangJianyu

启动命令: ./build/bin/llama-server -t 8 -c 10240 --host 0.0.0.0 --port 9999 -ngl 0 --batch-size 2048 -fa on --no-warmup
-m /data_n002/models/Qwen3-VL-2B-Instruct-GGUF/Qwen3VL-2B-Instruct-Q4_K_M.gguf
--mmproj /data_n002/models/Qwen3-VL-2B-Instruct-GGUF/mmproj-Qwen3VL-2B-Instruct-Q8_0.gguf

执行日志: main: setting n_parallel = 4 and kv_unified = true (add -kvu to disable this) build: 7031 (655cddd17) with Intel(R) oneAPI DPC++/C++ Compiler 2025.2.0 (2025.2.0.20250605) for x86_64-unknown-linux-gnu system info: n_threads = 8, n_threads_batch = 8, total_threads = 18

system_info: n_threads = 8 (n_threads_batch = 8) / 18 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

main: binding port with default address family main: HTTP server is listening, hostname: 0.0.0.0, port: 9999, http threads: 17 main: loading model srv load_model: loading model '/data_n002/models/Qwen3-VL-2B-Instruct-GGUF/Qwen3VL-2B-Instruct-Q4_K_M.gguf' get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory llama_model_load_from_file_impl: using device SYCL0 (Intel(R) Graphics) (unknown id) - 14130 MiB free llama_model_loader: loaded meta data with 32 key-value pairs and 310 tensors from /data_n002/models/Qwen3-VL-2B-Instruct-GGUF/Qwen3VL-2B-Instruct-Q4_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3vl llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3Vl 2b Instruct llama_model_loader: - kv 3: general.finetune str = instruct llama_model_loader: - kv 4: general.basename str = qwen3vl llama_model_loader: - kv 5: general.size_label str = 2B llama_model_loader: - kv 6: general.license str = apache-2.0 llama_model_loader: - kv 7: general.tags arr[str,1] = ["image-text-to-text"] llama_model_loader: - kv 8: qwen3vl.block_count u32 = 28 llama_model_loader: - kv 9: qwen3vl.context_length u32 = 262144 llama_model_loader: - kv 10: qwen3vl.embedding_length u32 = 2048 llama_model_loader: - kv 11: qwen3vl.feed_forward_length u32 = 6144 llama_model_loader: - kv 12: qwen3vl.attention.head_count u32 = 16 llama_model_loader: - kv 13: qwen3vl.attention.head_count_kv u32 = 8 llama_model_loader: - kv 14: qwen3vl.rope.freq_base f32 = 5000000.000000 llama_model_loader: - kv 15: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 16: qwen3vl.attention.key_length u32 = 128 llama_model_loader: - kv 17: qwen3vl.attention.value_length u32 = 128 llama_model_loader: - kv 18: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0] llama_model_loader: - kv 19: qwen3vl.n_deepstack_layers u32 = 3 llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 21: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ... llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 29: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 30: general.quantization_version u32 = 2 llama_model_loader: - kv 31: general.file_type u32 = 15 llama_model_loader: - type f32: 113 tensors llama_model_loader: - type q4_K: 168 tensors llama_model_loader: - type q6_K: 29 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 1.03 GiB (5.12 BPW) load: printing all EOG tokens: load: - 151643 ('<|endoftext|>') load: - 151645 ('<|im_end|>') load: - 151662 ('<|fim_pad|>') load: - 151663 ('<|repo_name|>') load: - 151664 ('<|file_sep|>') load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3vl print_info: vocab_only = 0 print_info: n_ctx_train = 262144 print_info: n_embd = 2048 print_info: n_embd_inp = 8192 print_info: n_layer = 28 print_info: n_head = 16 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: is_swa_any = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 2 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 6144 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: n_expert_groups = 0 print_info: n_group_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 40 print_info: rope scaling = linear print_info: freq_base_train = 5000000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 262144 print_info: rope_finetuned = unknown print_info: mrope sections = [24, 20, 20, 0] print_info: model type = 1.7B print_info: model params = 1.72 B print_info: general.name = Qwen3Vl 2b Instruct print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory load_tensors: offloading 0 repeating layers to GPU load_tensors: offloaded 0/29 layers to GPU load_tensors: CPU_Mapped model buffer size = 1050.43 MiB .............................................................................. llama_context: constructing llama_context llama_context: n_seq_max = 4 llama_context: n_ctx = 10240 llama_context: n_ctx_seq = 10240 llama_context: n_batch = 2048 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = enabled llama_context: kv_unified = true llama_context: freq_base = 5000000.0 llama_context: freq_scale = 1 llama_context: n_ctx_seq (10240) < n_ctx_train (262144) -- the full capacity of the model will not be utilized Running with Environment Variables: GGML_SYCL_DEBUG: 0 GGML_SYCL_DISABLE_OPT: 0 GGML_SYCL_DISABLE_GRAPH: 1 GGML_SYCL_DISABLE_DNN: 0 GGML_SYCL_PRIORITIZE_DMMV: 0 Build with Macros: GGML_SYCL_FORCE_MMQ: no GGML_SYCL_F16: yes Found 1 SYCL devices: | | | | |Max | |Max |Global | | | | | | |compute|Max work|sub |mem | |

ID Device Type Name Version units group group size Driver version
0 [opencl:gpu:0] Intel Graphics 3.0 112 1024 32 14816M 24.35.30872.22
SYCL Optimization Feature:
ID Device Type Reorder
-- ------------------- -------
0 [opencl:gpu:0] Y

llama_context: CPU output buffer size = 2.32 MiB llama_kv_cache: CPU KV buffer size = 1120.00 MiB llama_kv_cache: size = 1120.00 MiB ( 10240 cells, 28 layers, 4/1 seqs), K (f16): 560.00 MiB, V (f16): 560.00 MiB llama_context: SYCL0 compute buffer size = 558.02 MiB llama_context: SYCL_Host compute buffer size = 24.02 MiB llama_context: graph nodes = 987 llama_context: graph splits = 342 (with bs=512), 1 (with bs=1) common_init_from_params: added <|endoftext|> logit bias = -inf common_init_from_params: added <|im_end|> logit bias = -inf common_init_from_params: added <|fim_pad|> logit bias = -inf common_init_from_params: added <|repo_name|> logit bias = -inf common_init_from_params: added <|file_sep|> logit bias = -inf common_init_from_params: setting dry_penalty_last_n to ctx_size = 10240 clip_model_loader: model name: Qwen3Vl 2b Instruct clip_model_loader: description:
clip_model_loader: GGUF version: 3 clip_model_loader: alignment: 32 clip_model_loader: n_tensors: 316 clip_model_loader: n_kv: 25

clip_model_loader: has vision encoder clip_ctx: CLIP using SYCL0 backend load_hparams: Qwen-VL models require at minimum 1024 image tokens to function correctly on grounding tasks load_hparams: if you encounter problems with accuracy, try adding --image-min-tokens 1024 load_hparams: more info: https://github.com/ggml-org/llama.cpp/issues/16842

load_hparams: projector: qwen3vl_merger load_hparams: n_embd: 1024 load_hparams: n_head: 16 load_hparams: n_ff: 4096 load_hparams: n_layer: 24 load_hparams: ffn_op: gelu load_hparams: projection_dim: 2048

--- vision hparams --- load_hparams: image_size: 768 load_hparams: patch_size: 16 load_hparams: has_llava_proj: 0 load_hparams: minicpmv_version: 0 load_hparams: n_merge: 2 load_hparams: n_wa_pattern: 0 load_hparams: image_min_pixels: 8192 load_hparams: image_max_pixels: 4194304

load_hparams: model size: 424.42 MiB load_hparams: metadata size: 0.11 MiB alloc_compute_meta: warmup with image size = 1472 x 1472 alloc_compute_meta: SYCL0 compute buffer size = 330.75 MiB alloc_compute_meta: CPU compute buffer size = 99.19 MiB alloc_compute_meta: graph splits = 51, nodes = 766 warmup: ***************************************************************** warmup: WARNING: flash attention not supported by SYCL0, memory usage will increase warmup: op params: warmup: dst: type = f32, ne = [64 16 8464 1], nb = [4 256 4096 34668544] warmup: src0: type = f32, ne = [64 8464 16 1], nb = [4 4096 256 34668544] warmup: src1: type = f16, ne = [64 8464 16 1], nb = [2 128 1083392 17334272] warmup: src2: type = f16, ne = [64 8464 16 1], nb = [2 128 1083392 17334272] warmup: please report this on github as an issue warmup: ***************************************************************** alloc_compute_meta: warmup with image size = 1472 x 1472 ggml_backend_sycl_buffer_type_alloc_buffer: can't allocate 4584914944 Bytes of memory on device ggml_gallocr_reserve_n: failed to allocate SYCL0 buffer of size 9412645120 alloc_compute_meta: CPU compute buffer size = 99.19 MiB alloc_compute_meta: graph splits = 3, nodes = 814 warmup: flash attention is disabled warmup: ***************************************************************** warmup: WARNING: the CLIP graph uses unsupported operators by the backend warmup: the performance will be suboptimal
warmup: list of unsupported ops (backend=SYCL0): warmup: UPSCALE: type = f32, ne = [92 92 1024 1] warmup: SOFT_MAX: type = f32, ne = [8464 8464 16 1] warmup: CONT: type = f32, ne = [8464 64 16 1] warmup: PERMUTE: type = f32, ne = [64 8464 16 1] warmup: ROPE: type = f32, ne = [64 16 8464 1] warmup: VIEW: type = f32, ne = [64 16 8464 1] warmup: VIEW: type = f32, ne = [64 16 8464 1] warmup: MUL_MAT: type = f32, ne = [3072 8464 1 1] warmup: MUL: type = f32, ne = [1024 8464 1 1] warmup: ADD: type = f32, ne = [1024 8464 1 1] warmup: MUL_MAT: type = f32, ne = [1024 8464 1 1] warmup: ADD: type = f32, ne = [4096 8464 1 1] warmup: ADD: type = f32, ne = [1024 8464 1 1] warmup: NORM: type = f32, ne = [4096 2116 1 1] warmup: ADD: type = f32, ne = [1024 8464 1 1] warmup: CONT: type = f32, ne = [1024 8464 1 1] warmup: MUL_MAT: type = f32, ne = [64 8464 16 1] warmup: MUL_MAT: type = f32, ne = [8464 8464 16 1] warmup: PERMUTE: type = f32, ne = [8464 64 16 1] warmup: ADD: type = f32, ne = [1024 8464 1 1] warmup: ROPE: type = f32, ne = [64 16 8464 1] warmup: VIEW: type = f32, ne = [64 16 8464 1] warmup: ADD: type = f32, ne = [3072 8464 1 1] warmup: ADD: type = f32, ne = [1024 8464 1 1] warmup: NORM: type = f32, ne = [1024 8464 1 1] warmup: flash attention is disabled warmup: please report this on github as an issue warmup: ref: https://github.com/ggml-org/llama.cpp/pull/16837#issuecomment-3461676118 warmup: ***************************************************************** srv load_model: loaded multimodal model, '/data_n002/models/Qwen3-VL-2B-Instruct-GGUF/mmproj-Qwen3VL-2B-Instruct-Q8_0.gguf' srv init: initializing slots, n_slots = 4 slot init: id 0 | task -1 | new slot, n_ctx = 10240 slot init: id 1 | task -1 | new slot, n_ctx = 10240 slot init: id 2 | task -1 | new slot, n_ctx = 10240 slot init: id 3 | task -1 | new slot, n_ctx = 10240 srv init: prompt cache is enabled, size limit: 8192 MiB srv init: use --cache-ram 0 to disable the prompt cache srv init: for more info see https://github.com/ggml-org/llama.cpp/pull/16391 srv init: thinking = 0 main: model loaded main: chat template, chat_template: {%- if tools %} {{- '<|im_start|>system\n' }} {%- if messages[0].role == 'system' %} {%- if messages[0].content is string %} {{- messages[0].content }} {%- else %} {%- for content in messages[0].content %} {%- if 'text' in content %} {{- content.text }} {%- endif %} {%- endfor %} {%- endif %} {{- '\n\n' }} {%- endif %} {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within XML tags:\n" }} {%- for tool in tools %} {{- "\n" }} {{- tool | tojson }} {%- endfor %} {{- "\n\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{"name": , "arguments": }\n</tool_call><|im_end|>\n" }} {%- else %} {%- if messages[0].role == 'system' %} {{- '<|im_start|>system\n' }} {%- if messages[0].content is string %} {{- messages[0].content }} {%- else %} {%- for content in messages[0].content %} {%- if 'text' in content %} {{- content.text }} {%- endif %} {%- endfor %} {%- endif %} {{- '<|im_end|>\n' }} {%- endif %} {%- endif %} {%- set image_count = namespace(value=0) %} {%- set video_count = namespace(value=0) %} {%- for message in messages %} {%- if message.role == "user" %} {{- '<|im_start|>' + message.role + '\n' }} {%- if message.content is string %} {{- message.content }} {%- else %} {%- for content in message.content %} {%- if content.type == 'image' or 'image' in content or 'image_url' in content %} {%- set image_count.value = image_count.value + 1 %} {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%} <|vision_start|><|image_pad|><|vision_end|> {%- elif content.type == 'video' or 'video' in content %} {%- set video_count.value = video_count.value + 1 %} {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%} <|vision_start|><|video_pad|><|vision_end|> {%- elif 'text' in content %} {{- content.text }} {%- endif %} {%- endfor %} {%- endif %} {{- '<|im_end|>\n' }} {%- elif message.role == "assistant" %} {{- '<|im_start|>' + message.role + '\n' }} {%- if message.content is string %} {{- message.content }} {%- else %} {%- for content_item in message.content %} {%- if 'text' in content_item %} {{- content_item.text }} {%- endif %} {%- endfor %} {%- endif %} {%- if message.tool_calls %} {%- for tool_call in message.tool_calls %} {%- if (loop.first and message.content) or (not loop.first) %} {{- '\n' }} {%- endif %} {%- if tool_call.function %} {%- set tool_call = tool_call.function %} {%- endif %} {{- '<tool_call>\n{"name": "' }} {{- tool_call.name }} {{- '", "arguments": ' }} {%- if tool_call.arguments is string %} {{- tool_call.arguments }} {%- else %} {{- tool_call.arguments | tojson }} {%- endif %} {{- '}\n</tool_call>' }} {%- endfor %} {%- endif %} {{- '<|im_end|>\n' }} {%- elif message.role == "tool" %} {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %} {{- '<|im_start|>user' }} {%- endif %} {{- '\n<tool_response>\n' }} {%- if message.content is string %} {{- message.content }} {%- else %} {%- for content in message.content %} {%- if content.type == 'image' or 'image' in content or 'image_url' in content %} {%- set image_count.value = image_count.value + 1 %} {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%} <|vision_start|><|image_pad|><|vision_end|> {%- elif content.type == 'video' or 'video' in content %} {%- set video_count.value = video_count.value + 1 %} {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%} <|vision_start|><|video_pad|><|vision_end|> {%- elif 'text' in content %} {{- content.text }} {%- endif %} {%- endfor %} {%- endif %} {{- '\n</tool_response>' }} {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %} {{- '<|im_end|>\n' }} {%- endif %} {%- endif %} {%- endfor %} {%- if add_generation_prompt %} {{- '<|im_start|>assistant\n' }} {%- endif %} , example_format: '<|im_start|>system You are a helpful assistant<|im_end|> <|im_start|>user Hello<|im_end|> <|im_start|>assistant Hi there<|im_end|> <|im_start|>user How are you?<|im_end|> <|im_start|>assistant ' main: server is listening on http://0.0.0.0:9999 - starting the main loop srv update_slots: all slots are idle srv log_server_r: request: GET / 127.0.0.1 200 srv log_server_r: request: GET / 127.0.0.1 200 srv log_server_r: request: GET /props 10.1.0.25 200 srv log_server_r: request: GET /props 10.1.0.25 200 srv log_server_r: request: GET /props 10.1.0.25 200 srv log_server_r: request: GET /props 10.1.0.25 200 srv params_from_: Chat format: Content-only slot get_availabl: id 3 | task -1 | selected slot by LRU, t_last = -1 slot launch_slot_: id 3 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist slot launch_slot_: id 3 | task 0 | processing task slot update_slots: id 3 | task 0 | new prompt, n_ctx_slot = 10240, n_keep = 0, task.n_tokens = 280 slot update_slots: id 3 | task 0 | n_tokens = 0, memory_seq_rm [0, end) slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 8, batch.n_tokens = 8, progress = 0.028571 slot update_slots: id 3 | task 0 | n_tokens = 8, memory_seq_rm [8, end) srv process_chun: processing image... encoding image slice... Segmentation fault (core dumped)

spf1983 avatar Nov 14 '25 08:11 spf1983