llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Eval bug: When offloading to CPU after f77c13b commit using CUDA (MultiGPU), PP performance seems to be reduced by ~75% (CUDA: General GEMV fusion)

Open Panchovix opened this issue 1 month ago • 37 comments

Name and Version

./llama-cli --version ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 6 CUDA devices: Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 3: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes Device 4: NVIDIA RTX A6000, compute capability 8.6, VMM: yes Device 5: NVIDIA A40, compute capability 8.6, VMM: yes version: 6906 (0de0a0157) built with cc (GCC) 15.2.1 20251022 (Red Hat 15.2.1-3) for x86_64-redhat-linux

Operating systems

Linux

GGML backends

CUDA

Hardware

Fedora 42 AMD Ryzen 9 9900X 192GB RAM RTX 5090x2 RTX 4090x2 RTX A6000 RTX A40

Models

DeepSeek-V3-0324 DeepSeek-R1-0528 DeepSeek-V3.1 DeepSeek-V3.1-Terminus

Problem description & steps to reproduce

I build llamacpp with:


cmake -B build \
  -DGGML_CUDA=ON \
  -DGGML_CUDA_FA_ALL_QUANTS=ON \
  -DGGML_BLAS=OFF \
  -DGGML_RPC=ON \
  -DCMAKE_CUDA_ARCHITECTURES="86;89;120" \
  -DGGML_MAX_CONTEXTS=2048 \

When offloading DeepSeek V3 0324/R1 0528/V3.1 models with offloading, on https://github.com/ggml-org/llama.cpp/commit/5d195f17bc60eacc15cfb929f9403cf29ccdf419 commit, with:

LLAMA_SET_ROWS=1 ./llama-server -m '/Drive1_8TB/models_llm_2tb/DeepSeek-V3-0324-UD-Q3_K_XL.gguf' -c 32768 -ngl 999 --no-mmap \
-ot "blk.(0|1|2|3|4|5|6|7).ffn.=CUDA0" \
-ot "blk.(8|9|10|11).ffn.=CUDA1" \
-ot "blk.(12|13|14|15).ffn.=CUDA2" \
-ot "blk.(16|17|18|19|20).ffn.=CUDA3" \
-ot "blk.(21|22|23).ffn.=CUDA4" \
-ot "blk.(24|25|26|27).ffn.=CUDA4" \
-ot "blk.(28|29|30|31|32|33|34).ffn.=CUDA5" \
-ot "exps=CPU" \
-fa on -mg 0 -ub 2560 -b 256

When loading, it looks like this:

load_tensors:        CUDA0 model buffer size = 25363.28 MiB
load_tensors:        CUDA1 model buffer size = 19841.07 MiB
load_tensors:        CUDA2 model buffer size = 19842.82 MiB
load_tensors:        CUDA3 model buffer size = 24357.64 MiB
load_tensors:        CUDA4 model buffer size = 34490.44 MiB
load_tensors:        CUDA5 model buffer size = 35639.92 MiB
load_tensors:          CPU model buffer size =   497.11 MiB
load_tensors:    CUDA_Host model buffer size = 122500.00 MiB

As you can see, CPU model buffer size is just after GPUs (CUDA) and just before CUDA_Host. This nets me these speeds:

prompt eval time =   17797.43 ms /  4373 tokens (    4.07 ms per token,   245.71 tokens per second)
       eval time =   42683.82 ms /   453 tokens (   94.22 ms per token,    10.61 tokens per second)
      total time =   60481.25 ms /  4826 tokens

A variant of this while testing that also works fine is:

load_tensors:        CUDA0 model buffer size = 25363.28 MiB
load_tensors:        CUDA1 model buffer size = 19841.07 MiB
load_tensors:        CUDA2 model buffer size = 19842.82 MiB
load_tensors:        CUDA3 model buffer size = 24357.64 MiB
load_tensors:        CUDA4 model buffer size = 34490.44 MiB
load_tensors:        CUDA5 model buffer size = 35639.92 MiB
load_tensors:    CUDA_Host model buffer size = 122500.00 MiB
load_tensors:          CPU model buffer size =   497.11 MiB

While, after https://github.com/ggml-org/llama.cpp/commit/5d195f17bc60eacc15cfb929f9403cf29ccdf419 commit (I'm still not sure if the exact next one is the one causing the issue), it looks like this:

load_tensors:          CPU model buffer size =   497.11 MiB
load_tensors:        CUDA0 model buffer size = 25363.28 MiB
load_tensors:        CUDA1 model buffer size = 19841.07 MiB
load_tensors:        CUDA2 model buffer size = 19842.82 MiB
load_tensors:        CUDA3 model buffer size = 24357.64 MiB
load_tensors:        CUDA4 model buffer size = 34490.44 MiB
load_tensors:        CUDA5 model buffer size = 35639.92 MiB
load_tensors:    CUDA_Host model buffer size = 122500.00 MiB

Which nets me these speeds

prompt eval time =   49380.49 ms /  4373 tokens (   11.29 ms per token,    88.56 tokens per second)
       eval time =   50832.32 ms /   542 tokens (   93.79 ms per token,    10.66 tokens per second)

I have deleted and not used ccache on each build to not get any extra issues.

As reference, ik llamacpp handles this, via this way https://github.com/ikawrakow/ik_llama.cpp/pull/405

With this explanation

When part of the tensors are stored in RAM but there are faster back-ends available (GPU), the scheduler needs to decide if to offload the data for a given op to a faster back-end or to compute the op on the CPU. This is currently done via a simple heuristics where only matrix multiplications (GGML_MUL_MAT and GGML_MUL_MAT_ID) are offloaded if the batch size is larger than some threshold (currently 32). When fmoe is enabled, the fused (ffn_up*X)unary(ffn_gateX)) op is never uploaded. In contrast, in mainline llama.cpp matrix multiplications are always offloaded when the batch size is >= 32. The result of this is that when the batch size becomes large enough, llama.cpp will outperform ik_llama.cpp in prompt processing speed. As "large enough" depends on many factors (size of tensors that need to be uploaded, speed of the PCI-E bus to the GPU, relative speed of the GPU vs the CPU), it is hard to devise a better offload policy that automatically takes the best decision.

So it seems that for some reason, now some matrix calculations are done on the CPU instead of the main CUDA device? (CUDA0)

First Bad Commit

I'm not sure where it started exactly, but https://github.com/ggml-org/llama.cpp/commit/5d195f17bc60eacc15cfb929f9403cf29ccdf419 works fine.

Relevant log output

./llama-server -m '/Drive1_8TB/models_llm_2tb/DeepSeek-V3-0324-UD-Q3_K_XL.gguf' -c 32768 --no-mmap -ngl 999 \
-ot "blk.(0|1|2|3|4|5|6|7).ffn.=CUDA0" \
-ot "blk.(8|9|10|11).ffn.=CUDA1" \
-ot "blk.(12|13|14|15).ffn.=CUDA2" \
-ot "blk.(16|17|18|19|20).ffn.=CUDA3" \
-ot "blk.(21|22|23).ffn.=CUDA4" \
-ot "blk.(24|25|26|27).ffn.=CUDA4" \
-ot "blk.(28|29|30|31|32|33|34).ffn.=CUDA5" \
-ot "exps=CPU" \
-fa on -mg 0 -ub 2560 -b 2560 --cache-ram 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 6 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 3: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 4: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
  Device 5: NVIDIA A40, compute capability 8.6, VMM: yes
build: 6906 (0de0a0157) with cc (GCC) 15.2.1 20251022 (Red Hat 15.2.1-3) for x86_64-redhat-linux
system info: n_threads = 12, n_threads_batch = 12, total_threads = 24

system_info: n_threads = 12 (n_threads_batch = 12) / 24 | CUDA : ARCHS = 860,890,1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 23
main: loading model
srv    load_model: loading model '/Drive1_8TB/models_llm_2tb/DeepSeek-V3-0324-UD-Q3_K_XL.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 5090) (0000:01:00.0) - 31600 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 4090) (0000:02:00.0) - 23686 MiB free
llama_model_load_from_file_impl: using device CUDA2 (NVIDIA GeForce RTX 4090) (0000:17:00.0) - 23675 MiB free
llama_model_load_from_file_impl: using device CUDA3 (NVIDIA GeForce RTX 5090) (0000:03:00.0) - 31600 MiB free
llama_model_load_from_file_impl: using device CUDA4 (NVIDIA RTX A6000) (0000:0d:00.0) - 48268 MiB free
llama_model_load_from_file_impl: using device CUDA5 (NVIDIA A40) (0000:06:00.0) - 48268 MiB free
llama_model_loader: loaded meta data with 64 key-value pairs and 1086 tensors from /Drive1_8TB/models_llm_2tb/DeepSeek-V3-0324-UD-Q3_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Deepseek-V3-0324
llama_model_loader: - kv   3:                            general.version str              = V3-0324
llama_model_loader: - kv   4:                           general.basename str              = Deepseek-V3-0324
llama_model_loader: - kv   5:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   6:                         general.size_label str              = 256x20B
llama_model_loader: - kv   7:                            general.license str              = mit
llama_model_loader: - kv   8:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
llama_model_loader: - kv  10:                  general.base_model.0.name str              = DeepSeek V3 0324
llama_model_loader: - kv  11:               general.base_model.0.version str              = V3-0324
llama_model_loader: - kv  12:          general.base_model.0.organization str              = Deepseek Ai
llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/deepseek-ai/De...
llama_model_loader: - kv  14:                               general.tags arr[str,4]       = ["deepseek_v3", "deepseek", "unsloth"...
llama_model_loader: - kv  15:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  16:                      deepseek2.block_count u32              = 61
llama_model_loader: - kv  17:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv  18:                 deepseek2.embedding_length u32              = 7168
llama_model_loader: - kv  19:              deepseek2.feed_forward_length u32              = 18432
llama_model_loader: - kv  20:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv  21:          deepseek2.attention.head_count_kv u32              = 1
llama_model_loader: - kv  22:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  23: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  24:                deepseek2.expert_used_count u32              = 8
llama_model_loader: - kv  25:        deepseek2.leading_dense_block_count u32              = 3
llama_model_loader: - kv  26:                       deepseek2.vocab_size u32              = 129280
llama_model_loader: - kv  27:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  28:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  29:             deepseek2.attention.key_length u32              = 576
llama_model_loader: - kv  30:           deepseek2.attention.value_length u32              = 512
llama_model_loader: - kv  31:         deepseek2.attention.key_length_mla u32              = 192
llama_model_loader: - kv  32:       deepseek2.attention.value_length_mla u32              = 128
llama_model_loader: - kv  33:       deepseek2.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  34:                     deepseek2.expert_count u32              = 256
llama_model_loader: - kv  35:              deepseek2.expert_shared_count u32              = 1
llama_model_loader: - kv  36:             deepseek2.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  37:              deepseek2.expert_weights_norm bool             = true
llama_model_loader: - kv  38:               deepseek2.expert_gating_func u32              = 2
llama_model_loader: - kv  39:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  40:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  41:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  42: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  43: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  44:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  45:                         tokenizer.ggml.pre str              = deepseek-v3
llama_model_loader: - kv  46:                      tokenizer.ggml.tokens arr[str,129280]  = ["<|begin▁of▁sentence|>", "<�...
llama_model_loader: - kv  47:                  tokenizer.ggml.token_type arr[i32,129280]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  48:                      tokenizer.ggml.merges arr[str,127741]  = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
llama_model_loader: - kv  49:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  50:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  51:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  52:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  53:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  54:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  55:               general.quantization_version u32              = 2
llama_model_loader: - kv  56:                          general.file_type u32              = 12
llama_model_loader: - kv  57:                      quantize.imatrix.file str              = DeepSeek-V3-0324-GGUF/imatrix_unsloth...
llama_model_loader: - kv  58:                   quantize.imatrix.dataset str              = unsloth_calibration_DeepSeek-V3-0324.txt
llama_model_loader: - kv  59:             quantize.imatrix.entries_count i32              = 720
llama_model_loader: - kv  60:              quantize.imatrix.chunks_count i32              = 60
llama_model_loader: - kv  61:                                   split.no u16              = 0
llama_model_loader: - kv  62:                        split.tensors.count i32              = 1086
llama_model_loader: - kv  63:                                split.count u16              = 0
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q8_0:  122 tensors
llama_model_loader: - type q3_K:  173 tensors
llama_model_loader: - type q4_K:  385 tensors
llama_model_loader: - type q5_K:   29 tensors
llama_model_loader: - type q6_K:   16 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q3_K - Medium
print_info: file size   = 275.91 GiB (3.53 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 1 ('<|end▁of▁sentence|>')
load: special tokens cache size = 818
load: token to piece cache size = 0.8223 MB
print_info: arch             = deepseek2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 163840
print_info: n_embd           = 7168
print_info: n_layer          = 61
print_info: n_head           = 128
print_info: n_head_kv        = 1
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 576
print_info: n_embd_head_v    = 512
print_info: n_gqa            = 128
print_info: n_embd_k_gqa     = 576
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 18432
print_info: n_expert         = 256
print_info: n_expert_used    = 8
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = yarn
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 0.025
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: model type       = 671B
print_info: model params     = 671.03 B
print_info: general.name     = Deepseek-V3-0324
print_info: n_layer_dense_lead   = 3
print_info: n_lora_q             = 1536
print_info: n_lora_kv            = 512
print_info: n_embd_head_k_mla    = 192
print_info: n_embd_head_v_mla    = 128
print_info: n_ff_exp             = 2048
print_info: n_expert_shared      = 1
print_info: expert_weights_scale = 2.5
print_info: expert_weights_norm  = 1
print_info: expert_gating_func   = sigmoid
print_info: rope_yarn_log_mul    = 0.1000
print_info: vocab type       = BPE
print_info: n_vocab          = 129280
print_info: n_merges         = 127741
print_info: BOS token        = 0 '<|begin▁of▁sentence|>'
print_info: EOS token        = 1 '<|end▁of▁sentence|>'
print_info: EOT token        = 1 '<|end▁of▁sentence|>'
print_info: PAD token        = 2 '<|▁pad▁|>'
print_info: LF token         = 201 'Ċ'
print_info: FIM PRE token    = 128801 '<|fim▁begin|>'
print_info: FIM SUF token    = 128800 '<|fim▁hole|>'
print_info: FIM MID token    = 128802 '<|fim▁end|>'
print_info: EOG token        = 1 '<|end▁of▁sentence|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 61 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 62/62 layers to GPU
load_tensors:          CPU model buffer size =   497.11 MiB
load_tensors:        CUDA0 model buffer size = 25363.28 MiB
load_tensors:        CUDA1 model buffer size = 19841.07 MiB
load_tensors:        CUDA2 model buffer size = 19842.82 MiB
load_tensors:        CUDA3 model buffer size = 24357.64 MiB
load_tensors:        CUDA4 model buffer size = 34490.44 MiB
load_tensors:        CUDA5 model buffer size = 35639.92 MiB
load_tensors:    CUDA_Host model buffer size = 122500.00 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 32768
llama_context: n_ctx_per_seq = 32768
llama_context: n_batch       = 2560
llama_context: n_ubatch      = 2560
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 0.025
llama_context: n_ctx_per_seq (32768) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.49 MiB
llama_kv_cache:      CUDA0 KV buffer size =   680.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =   476.00 MiB
llama_kv_cache:      CUDA2 KV buffer size =   476.00 MiB
llama_kv_cache:      CUDA3 KV buffer size =   680.00 MiB
llama_kv_cache:      CUDA4 KV buffer size =   952.00 MiB
llama_kv_cache:      CUDA5 KV buffer size =   884.00 MiB
llama_kv_cache: size = 4148.00 MiB ( 32768 cells,  61 layers,  1/1 seqs), K (f16): 2196.00 MiB, V (f16): 1952.00 MiB
llama_context:      CUDA0 compute buffer size =  3628.50 MiB
llama_context:      CUDA1 compute buffer size =  2052.63 MiB
llama_context:      CUDA2 compute buffer size =  1995.05 MiB
llama_context:      CUDA3 compute buffer size =  1995.05 MiB
llama_context:      CUDA4 compute buffer size =  4848.52 MiB
llama_context:      CUDA5 compute buffer size =  4848.53 MiB
llama_context:  CUDA_Host compute buffer size =   390.07 MiB
llama_context: graph nodes  = 4843
llama_context: graph splits = 206 (with bs=2560), 154 (with bs=1)
common_init_from_params: added <|end▁of▁sentence|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 32768
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 32768
srv          init: prompt cache is enabled, size limit: 8192 MiB
srv          init: use `--cache-ram 0` to disable the prompt cache
srv          init: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
srv          init: thinking = 0
main: model loaded
main: chat template, chat_template: {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='', is_first_sp=true, is_last_user=false) %}{%- for message in messages %}{%- if message['role'] == 'system' %}{%- if ns.is_first_sp %}{% set ns.system_prompt = ns.system_prompt + message['content'] %}{% set ns.is_first_sp = false %}{%- else %}{% set ns.system_prompt = ns.system_prompt + '

' + message['content'] %}{%- endif %}{%- endif %}{%- endfor %}{{ bos_token }}{{ ns.system_prompt }}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{%- set ns.is_first = false -%}{%- set ns.is_last_user = true -%}{{'<|User|>' + message['content'] + '<|Assistant|>'}}{%- endif %}{%- if message['role'] == 'assistant' and message['tool_calls'] is defined and message['tool_calls'] is not none %}{%- set ns.is_last_user = false -%}{%- if ns.is_tool %}{{'<|tool▁outputs▁end|>'}}{%- endif %}{%- set ns.is_first = false %}{%- set ns.is_tool = false -%}{%- set ns.is_output_first = true %}{%- for tool in message['tool_calls'] %}{%- if not ns.is_first %}{%- if message['content'] is none %}{{'<|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '
' + '' + '
' + tool['function']['arguments'] + '
' + '' + '<|tool▁call▁end|>'}}{%- else %}{{message['content'] + '<|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '
' + '' + '
' + tool['function']['arguments'] + '
' + '' + '<|tool▁call▁end|>'}}{%- endif %}{%- set ns.is_first = true -%}{%- else %}{{'
' + '<|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '
' + '' + '
' + tool['function']['arguments'] + '
' + '' + '<|tool▁call▁end|>'}}{%- endif %}{%- endfor %}{{'<|tool▁calls▁end|><|end▁of▁sentence|>'}}{%- endif %}{%- if message['role'] == 'assistant' and (message['tool_calls'] is not defined or message['tool_calls'] is none)%}{%- set ns.is_last_user = false -%}{%- if ns.is_tool %}{{'<|tool▁outputs▁end|>' + message['content'] + '<|end▁of▁sentence|>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{{content + '<|end▁of▁sentence|>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_last_user = false -%}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<|tool▁outputs▁begin|><|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- set ns.is_output_first = false %}{%- else %}{{'
<|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<|tool▁outputs▁end|>'}}{% endif %}{% if add_generation_prompt and not ns.is_last_user and not ns.is_tool %}{{'<|Assistant|>'}}{% endif %}, example_format: 'You are a helpful assistant

<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>'
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv  update_slots: all slots are idle
srv  log_server_r: request: GET /v1/models 127.0.0.1 200
check_double_bos_eos: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
common_sampler_types_from_names: unable to match sampler by name 'tfs_z'
common_sampler_types_from_names: unable to match sampler by name 'typical_p'
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 4373
slot update_slots: id  0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 2560, batch.n_tokens = 2560, progress = 0.585410
slot update_slots: id  0 | task 0 | n_tokens = 2560, memory_seq_rm [2560, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 4373, batch.n_tokens = 1813, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_tokens = 4373, batch.n_tokens = 1813
slot print_timing: id  0 | task 0 |
prompt eval time =   49380.49 ms /  4373 tokens (   11.29 ms per token,    88.56 tokens per second)
       eval time =   50832.32 ms /   542 tokens (   93.79 ms per token,    10.66 tokens per second)
      total time =  100212.80 ms /  4915 tokens
slot      release: id  0 | task 0 | stop processing: n_tokens = 4914, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /completion 127.0.0.1 200
srv  log_server_r: request: POST /tokenize 127.0.0.1 200
^Csrv    operator(): operator(): cleaning up before exit...
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free      self    model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 5090)   | 32109 = 1426 + ( 29671 =  25363 +     680 +    3628) +        1010 |
llama_memory_breakdown_print: |   - CUDA1 (RTX 4090)   | 24080 =  806 + ( 22369 =  19841 +     476 +    2052) +         905 |
llama_memory_breakdown_print: |   - CUDA2 (RTX 4090)   | 24077 =  851 + ( 22313 =  19842 +     476 +    1995) +         913 |
llama_memory_breakdown_print: |   - CUDA3 (RTX 5090)   | 32109 = 4034 + ( 27032 =  24357 +     680 +    1995) +        1042 |
llama_memory_breakdown_print: |   - CUDA4 (RTX A6000)  | 48539 = 2498 + ( 40290 =  34490 +     952 +    4848) +        5749 |
llama_memory_breakdown_print: |   - CUDA5 (A40)        | 48539 = 1418 + ( 41372 =  35639 +     884 +    4848) +        5748 |
llama_memory_breakdown_print: |   - Host               |                 123387 = 122997 +       0 +     390                |

Panchovix avatar Nov 01 '25 03:11 Panchovix

As reference, when using https://github.com/ggml-org/llama.cpp/commit/5d195f17bc60eacc15cfb929f9403cf29ccdf419 commit, relevant log output looks like this (and with correct speeds):

./llama-server -m '/Drive1_8TB/models_llm_2tb/DeepSeek-V3-0324-UD-Q3_K_XL.gguf' -c 32768 -ngl 999 --no-mmap \
-ot "blk.(0|1|2|3|4|5|6|7).ffn.=CUDA0" \
-ot "blk.(8|9|10|11).ffn.=CUDA1" \
-ot "blk.(12|13|14|15).ffn.=CUDA2" \
-ot "blk.(16|17|18|19|20).ffn.=CUDA3" \
-ot "blk.(21|22|23).ffn.=CUDA4" \
-ot "blk.(24|25|26|27).ffn.=CUDA4" \
-ot "blk.(28|29|30|31|32|33|34).ffn.=CUDA5" \
-ot "exps=CPU" \
-fa on -mg 0 -ub 2560 -b 2560
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 6 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 3: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 4: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
  Device 5: NVIDIA A40, compute capability 8.6, VMM: yes
build: 6839 (5d195f17b) with cc (GCC) 15.2.1 20251022 (Red Hat 15.2.1-3) for x86_64-redhat-linux
system info: n_threads = 12, n_threads_batch = 12, total_threads = 24

system_info: n_threads = 12 (n_threads_batch = 12) / 24 | CUDA : ARCHS = 860,890,1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 23
main: loading model
srv    load_model: loading model '/Drive1_8TB/models_llm_2tb/DeepSeek-V3-0324-UD-Q3_K_XL.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 5090) (0000:01:00.0) - 31600 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 4090) (0000:02:00.0) - 23686 MiB free
llama_model_load_from_file_impl: using device CUDA2 (NVIDIA GeForce RTX 4090) (0000:17:00.0) - 23675 MiB free
llama_model_load_from_file_impl: using device CUDA3 (NVIDIA GeForce RTX 5090) (0000:03:00.0) - 31600 MiB free
llama_model_load_from_file_impl: using device CUDA4 (NVIDIA RTX A6000) (0000:0d:00.0) - 48268 MiB free
llama_model_load_from_file_impl: using device CUDA5 (NVIDIA A40) (0000:06:00.0) - 48268 MiB free
llama_model_loader: loaded meta data with 64 key-value pairs and 1086 tensors from /Drive1_8TB/models_llm_2tb/DeepSeek-V3-0324-UD-Q3_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Deepseek-V3-0324
llama_model_loader: - kv   3:                            general.version str              = V3-0324
llama_model_loader: - kv   4:                           general.basename str              = Deepseek-V3-0324
llama_model_loader: - kv   5:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   6:                         general.size_label str              = 256x20B
llama_model_loader: - kv   7:                            general.license str              = mit
llama_model_loader: - kv   8:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
llama_model_loader: - kv  10:                  general.base_model.0.name str              = DeepSeek V3 0324
llama_model_loader: - kv  11:               general.base_model.0.version str              = V3-0324
llama_model_loader: - kv  12:          general.base_model.0.organization str              = Deepseek Ai
llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/deepseek-ai/De...
llama_model_loader: - kv  14:                               general.tags arr[str,4]       = ["deepseek_v3", "deepseek", "unsloth"...
llama_model_loader: - kv  15:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  16:                      deepseek2.block_count u32              = 61
llama_model_loader: - kv  17:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv  18:                 deepseek2.embedding_length u32              = 7168
llama_model_loader: - kv  19:              deepseek2.feed_forward_length u32              = 18432
llama_model_loader: - kv  20:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv  21:          deepseek2.attention.head_count_kv u32              = 1
llama_model_loader: - kv  22:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  23: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  24:                deepseek2.expert_used_count u32              = 8
llama_model_loader: - kv  25:        deepseek2.leading_dense_block_count u32              = 3
llama_model_loader: - kv  26:                       deepseek2.vocab_size u32              = 129280
llama_model_loader: - kv  27:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  28:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  29:             deepseek2.attention.key_length u32              = 576
llama_model_loader: - kv  30:           deepseek2.attention.value_length u32              = 512
llama_model_loader: - kv  31:         deepseek2.attention.key_length_mla u32              = 192
llama_model_loader: - kv  32:       deepseek2.attention.value_length_mla u32              = 128
llama_model_loader: - kv  33:       deepseek2.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  34:                     deepseek2.expert_count u32              = 256
llama_model_loader: - kv  35:              deepseek2.expert_shared_count u32              = 1
llama_model_loader: - kv  36:             deepseek2.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  37:              deepseek2.expert_weights_norm bool             = true
llama_model_loader: - kv  38:               deepseek2.expert_gating_func u32              = 2
llama_model_loader: - kv  39:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  40:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  41:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  42: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  43: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  44:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  45:                         tokenizer.ggml.pre str              = deepseek-v3
llama_model_loader: - kv  46:                      tokenizer.ggml.tokens arr[str,129280]  = ["<|begin▁of▁sentence|>", "<�...
llama_model_loader: - kv  47:                  tokenizer.ggml.token_type arr[i32,129280]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  48:                      tokenizer.ggml.merges arr[str,127741]  = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
llama_model_loader: - kv  49:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  50:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  51:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  52:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  53:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  54:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  55:               general.quantization_version u32              = 2
llama_model_loader: - kv  56:                          general.file_type u32              = 12
llama_model_loader: - kv  57:                      quantize.imatrix.file str              = DeepSeek-V3-0324-GGUF/imatrix_unsloth...
llama_model_loader: - kv  58:                   quantize.imatrix.dataset str              = unsloth_calibration_DeepSeek-V3-0324.txt
llama_model_loader: - kv  59:             quantize.imatrix.entries_count i32              = 720
llama_model_loader: - kv  60:              quantize.imatrix.chunks_count i32              = 60
llama_model_loader: - kv  61:                                   split.no u16              = 0
llama_model_loader: - kv  62:                        split.tensors.count i32              = 1086
llama_model_loader: - kv  63:                                split.count u16              = 0
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q8_0:  122 tensors
llama_model_loader: - type q3_K:  173 tensors
llama_model_loader: - type q4_K:  385 tensors
llama_model_loader: - type q5_K:   29 tensors
llama_model_loader: - type q6_K:   16 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q3_K - Medium
print_info: file size   = 275.91 GiB (3.53 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 1 ('<|end▁of▁sentence|>')
load: special tokens cache size = 818
load: token to piece cache size = 0.8223 MB
print_info: arch             = deepseek2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 163840
print_info: n_embd           = 7168
print_info: n_layer          = 61
print_info: n_head           = 128
print_info: n_head_kv        = 1
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 576
print_info: n_embd_head_v    = 512
print_info: n_gqa            = 128
print_info: n_embd_k_gqa     = 576
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 18432
print_info: n_expert         = 256
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = yarn
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 0.025
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: model type       = 671B
print_info: model params     = 671.03 B
print_info: general.name     = Deepseek-V3-0324
print_info: n_layer_dense_lead   = 3
print_info: n_lora_q             = 1536
print_info: n_lora_kv            = 512
print_info: n_embd_head_k_mla    = 192
print_info: n_embd_head_v_mla    = 128
print_info: n_ff_exp             = 2048
print_info: n_expert_shared      = 1
print_info: expert_weights_scale = 2.5
print_info: expert_weights_norm  = 1
print_info: expert_gating_func   = sigmoid
print_info: rope_yarn_log_mul    = 0.1000
print_info: vocab type       = BPE
print_info: n_vocab          = 129280
print_info: n_merges         = 127741
print_info: BOS token        = 0 '<|begin▁of▁sentence|>'
print_info: EOS token        = 1 '<|end▁of▁sentence|>'
print_info: EOT token        = 1 '<|end▁of▁sentence|>'
print_info: PAD token        = 2 '<|▁pad▁|>'
print_info: LF token         = 201 'Ċ'
print_info: FIM PRE token    = 128801 '<|fim▁begin|>'
print_info: FIM SUF token    = 128800 '<|fim▁hole|>'
print_info: FIM MID token    = 128802 '<|fim▁end|>'
print_info: EOG token        = 1 '<|end▁of▁sentence|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 61 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 62/62 layers to GPU
load_tensors:        CUDA0 model buffer size = 25363.28 MiB
load_tensors:        CUDA1 model buffer size = 19841.07 MiB
load_tensors:        CUDA2 model buffer size = 19842.82 MiB
load_tensors:        CUDA3 model buffer size = 24357.64 MiB
load_tensors:        CUDA4 model buffer size = 34490.44 MiB
load_tensors:        CUDA5 model buffer size = 35639.92 MiB
load_tensors:          CPU model buffer size =   497.11 MiB
load_tensors:    CUDA_Host model buffer size = 122500.00 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 32768
llama_context: n_ctx_per_seq = 32768
llama_context: n_batch       = 2560
llama_context: n_ubatch      = 2560
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 0.025
llama_context: n_ctx_per_seq (32768) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.49 MiB
llama_kv_cache:      CUDA0 KV buffer size =   680.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =   476.00 MiB
llama_kv_cache:      CUDA2 KV buffer size =   476.00 MiB
llama_kv_cache:      CUDA3 KV buffer size =   680.00 MiB
llama_kv_cache:      CUDA4 KV buffer size =   952.00 MiB
llama_kv_cache:      CUDA5 KV buffer size =   884.00 MiB
llama_kv_cache: size = 4148.00 MiB ( 32768 cells,  61 layers,  1/1 seqs), K (f16): 2196.00 MiB, V (f16): 1952.00 MiB
llama_context:      CUDA0 compute buffer size =  3628.50 MiB
llama_context:      CUDA1 compute buffer size =  2052.63 MiB
llama_context:      CUDA2 compute buffer size =  1995.05 MiB
llama_context:      CUDA3 compute buffer size =  1995.05 MiB
llama_context:      CUDA4 compute buffer size =  2050.05 MiB
llama_context:      CUDA5 compute buffer size =  2050.06 MiB
llama_context:  CUDA_Host compute buffer size =   390.07 MiB
llama_context: graph nodes  = 4785
llama_context: graph splits = 206 (with bs=2560), 154 (with bs=1)
common_init_from_params: added <|end▁of▁sentence|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 32768
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 32768
srv          init: prompt cache is enabled, size limit: 8192 MiB
srv          init: use `--cache-ram 0` to disable the prompt cache
srv          init: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
srv          init: thinking = 0
main: model loaded
main: chat template, chat_template: {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='', is_first_sp=true, is_last_user=false) %}{%- for message in messages %}{%- if message['role'] == 'system' %}{%- if ns.is_first_sp %}{% set ns.system_prompt = ns.system_prompt + message['content'] %}{% set ns.is_first_sp = false %}{%- else %}{% set ns.system_prompt = ns.system_prompt + '

' + message['content'] %}{%- endif %}{%- endif %}{%- endfor %}{{ bos_token }}{{ ns.system_prompt }}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{%- set ns.is_first = false -%}{%- set ns.is_last_user = true -%}{{'<|User|>' + message['content'] + '<|Assistant|>'}}{%- endif %}{%- if message['role'] == 'assistant' and message['tool_calls'] is defined and message['tool_calls'] is not none %}{%- set ns.is_last_user = false -%}{%- if ns.is_tool %}{{'<|tool▁outputs▁end|>'}}{%- endif %}{%- set ns.is_first = false %}{%- set ns.is_tool = false -%}{%- set ns.is_output_first = true %}{%- for tool in message['tool_calls'] %}{%- if not ns.is_first %}{%- if message['content'] is none %}{{'<|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '
' + '```json' + '
' + tool['function']['arguments'] + '
' + '```' + '<|tool▁call▁end|>'}}{%- else %}{{message['content'] + '<|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '
' + '```json' + '
' + tool['function']['arguments'] + '
' + '```' + '<|tool▁call▁end|>'}}{%- endif %}{%- set ns.is_first = true -%}{%- else %}{{'
' + '<|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '
' + '```json' + '
' + tool['function']['arguments'] + '
' + '```' + '<|tool▁call▁end|>'}}{%- endif %}{%- endfor %}{{'<|tool▁calls▁end|><|end▁of▁sentence|>'}}{%- endif %}{%- if message['role'] == 'assistant' and (message['tool_calls'] is not defined or message['tool_calls'] is none)%}{%- set ns.is_last_user = false -%}{%- if ns.is_tool %}{{'<|tool▁outputs▁end|>' + message['content'] + '<|end▁of▁sentence|>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{{content + '<|end▁of▁sentence|>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_last_user = false -%}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<|tool▁outputs▁begin|><|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- set ns.is_output_first = false %}{%- else %}{{'
<|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<|tool▁outputs▁end|>'}}{% endif %}{% if add_generation_prompt and not ns.is_last_user and not ns.is_tool %}{{'<|Assistant|>'}}{% endif %}, example_format: 'You are a helpful assistant

<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>'
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv  update_slots: all slots are idle
check_double_bos_eos: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
common_sampler_types_from_names: unable to match sampler by name 'tfs_z'
common_sampler_types_from_names: unable to match sampler by name 'typical_p'
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 4373
slot update_slots: id  0 | task 0 | n_past = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 2560, n_tokens = 2560, progress = 0.585410
slot update_slots: id  0 | task 0 | n_past = 2560, memory_seq_rm [2560, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 4373, n_tokens = 1813, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 4373, n_tokens = 1813
slot print_timing: id  0 | task 0 |
prompt eval time =   17807.96 ms /  4373 tokens (    4.07 ms per token,   245.56 tokens per second)
       eval time =   43334.85 ms /   441 tokens (   98.26 ms per token,    10.18 tokens per second)
      total time =   61142.81 ms /  4814 tokens
slot      release: id  0 | task 0 | stop processing: n_past = 4813, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /completion 127.0.0.1 200
srv  log_server_r: request: POST /tokenize 127.0.0.1 200
^Csrv    operator(): operator(): cleaning up before exit...
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free      self    model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 5090)   | 32109 = 1426 + ( 29671 =  25363 +     680 +    3628) +        1010 |
llama_memory_breakdown_print: |   - CUDA1 (RTX 4090)   | 24080 =  806 + ( 22369 =  19841 +     476 +    2052) +         905 |
llama_memory_breakdown_print: |   - CUDA2 (RTX 4090)   | 24077 =  851 + ( 22313 =  19842 +     476 +    1995) +         913 |
llama_memory_breakdown_print: |   - CUDA3 (RTX 5090)   | 32109 = 4034 + ( 27032 =  24357 +     680 +    1995) +        1042 |
llama_memory_breakdown_print: |   - CUDA4 (RTX A6000)  | 48539 = 5296 + ( 37492 =  34490 +     952 +    2050) +        5750 |
llama_memory_breakdown_print: |   - CUDA5 (A40)        | 48539 = 4216 + ( 38573 =  35639 +     884 +    2050) +        5748 |
llama_memory_breakdown_print: |   - Host               |                 123387 = 122997 +       0 +     390                |

Panchovix avatar Nov 01 '25 03:11 Panchovix

I think it might be #16715, but I'm not sure how the fusion would affect offload. @slaren can you help?

am17an avatar Nov 01 '25 06:11 am17an

@Panchovix can you confirm if the problem goes away with GGML_CUDA_DISABLE_FUSION=1?

am17an avatar Nov 01 '25 06:11 am17an

Yeah, I've noticed something similar as I've got a custom hack in the CUDA backend that patches the batch size is >= 32 to a value I can read in using an environment variable.

I noticed last week the break-even value for deepseek went from 1800-1900 to ~2700, and kimi-k2 from 2700-2800 to ~4000.

I'll try the GGML_CUDA_DISABLE_FUSION setting and report back if it reduces the break-even value back.

jukofyork avatar Nov 01 '25 10:11 jukofyork

~One other weird thing I noticed was this:~

BATCH_SIZE=8192
PP_SIZE=2048

./llama-bench \
	--model "$MODEL_PATH" \
	--batch-size $BATCH_SIZE \
	--ubatch-size $BATCH_SIZE \
	--n-gpu-layers 99 \
	--flash-attn 1 \
	--numa distribute \
	--threads $(nproc) \
	--override-tensor exps=CPU \
	--n-prompt $PP_SIZE \
	--n-gen 0 \
	--no-op-offload 1,0

~If you run this and change PP_SIZE to be a multiple of 512 then you get way better PP speed - @Panchovix can you test something similar for your setup and see if this is the case for you too? This might help narrow down the problem.~

~This wasn't the case before, as I was using PP_SIZE = 1800, etc before.~

this isn't actually related to this as just realised it does it for the non-offloaded run too - please ignore!

jukofyork avatar Nov 01 '25 10:11 jukofyork

I would need a simpler way to reproduce this (e.g. single GPU, no -ot). If that's not possible, then you can try dumping the graph splits with GGML_SCHED_DEBUG=2 and try to figure what has changed between the two versions.

slaren avatar Nov 01 '25 10:11 slaren

Hello, sorry for the delay, I was not home.

Pardon my ignorance as I don't know if the GGML variables have to be compiled or used as env variable

First I tried @am17an suggestion, and compiled with -DGGML_CUDA_DISABLE_FUSION=1, but issue persists.

Ran with GGML_CUDA_DISABLE_FUSION=1 when loading the model and issue persists.

I noticed that for some reason on latest commits it seems to be do a prompt processing part on A40/A6000 GPUs, noted by their RX on nvtop, limited by PCIe X4 4.0

Image

It goes A6000 -> A40 -> 5090.

While on https://github.com/ggml-org/llama.cpp/commit/5d195f17bc60eacc15cfb929f9403cf29ccdf419 commit, seems to be done all mostly on CUDA 0 at the limit of PCIe X8 5.0.

Image

It goes directly to the 5090.

@slaren how would I exactly use GGML_SCHED_DEBUG=2? I did try both -DGGML_SCHED_DEBUG=2 when compiling and GGML_SCHED_DEBUG=2 as env variable but I don't see an output difference. Would I have to use -v?

Panchovix avatar Nov 01 '25 16:11 Panchovix

Okay I think yes I had to use -v.

I attach the outputs when using GGML_SCHED_DEBUG=2. Output is gigantic.

Latest commit as of yesterday with the issue (https://github.com/ggml-org/llama.cpp/commit/0de0a01576772032008a689afc4d7c80685074c4)

bad_commit.txt

And https://github.com/ggml-org/llama.cpp/commit/5d195f17bc60eacc15cfb929f9403cf29ccdf419 where it works correctly.

good_commit.txt

Panchovix avatar Nov 01 '25 16:11 Panchovix

I did a few more tests if it helps, now I'm using 2x3090 instead of an A40 but the issue persists.

I tried this PR https://github.com/ggml-org/llama.cpp/pull/16935 (CUDA: avoid mul + bias fusion when buffers are split) but issue persists.

Then, also on n latest commit, using fa auto, it gives this message:

llama_context: layer 0 is assigned to device CUDA0 but the Flash Attention tensor is assigned to device CPU (usually due to missing support)
llama_context: Flash Attention was auto, set to disabled

While on https://github.com/ggml-org/llama.cpp/commit/5d195f17bc60eacc15cfb929f9403cf29ccdf419, it gave:

llama_kv_cache: size = 3046.19 MiB ( 24064 cells,  61 layers,  1/1 seqs), K (f16): 1612.69 MiB, V (f16): 1433.50 MiB
llama_context: Flash Attention was auto, set to enabled

So maybe it is related to that?

Panchovix avatar Nov 02 '25 16:11 Panchovix

First of all, sorry for changing the title so many times, but finally found the commit.

After doing more tests, I can confirm that "CUDA: General GEMV fusion" is where the issue starts (commit f77c13b91f4d25754b6a0b857f98a6bc922a0aa7).

Now, there is another commit that I though it was related, that puts CPU model buffer first, which is commit 7a0e900.

First I tried reverting the commit that order buffers (7a0e900) and building with -DGGML_CUDA_DISABLE_FUSION=1, but sadly it didn't work.

Then, I went into the commit that order buffers 7a0e900 and reverted "CUDA: add unused vars to mmvf and mmvq" (463bbf2) and then reverted "CUDA: General GEMV fusion" (f77c13b), built normally (without DGGML_CUDA_DISABLE_FUSION) to try if commits are related and it looks like this when loading:

load_tensors:          CPU model buffer size =   497.11 MiB
load_tensors:        CUDA0 model buffer size = 25363.28 MiB
load_tensors:        CUDA1 model buffer size = 19841.07 MiB
load_tensors:        CUDA2 model buffer size = 19842.82 MiB
load_tensors:        CUDA3 model buffer size = 24357.64 MiB
load_tensors:        CUDA4 model buffer size = 34490.44 MiB
load_tensors:        CUDA5 model buffer size = 35639.92 MiB
load_tensors:    CUDA_Host model buffer size = 122500.00 MiB

(CPU model buffer first), and here it works fine!

rompt eval time =   17781.40 ms /  4373 tokens (    4.07 ms per token,   245.93 tokens per second)
       eval time =   40457.04 ms /   427 tokens (   94.75 ms per token,    10.55 tokens per second)

Then at the end, on latest master commit (7e99416), reverted in this order: First, reverted "CUDA: add expert reduce kernel" (4146d6a), then reverted "CUDA: add unused vars to mmvf and mmvq" (463bbf2) and then reverted "CUDA: General GEMV fusion" (f77c13b91f4d25754b6a0b857f98a6bc922a0aa7) (when resolving conflicts, I kept incoming instead of current), built normally (without DGGML_CUDA_DISABLE_FUSION) and here it also works correctly!

Also, this with a single GPU + offloading it doesn't happen (tested on DeepSeek V2), so it seems to be a multiGPU bug.

@am17an, @slaren and @JohannesGaessler, sorry for pinging you guys, but do you may have an idea what could be causing this? Maybe a way to disable General GEMV fusion would work as well.

I can do any test tomorrow if needed, as it is late here on Chile.

Panchovix avatar Nov 03 '25 02:11 Panchovix

You need to set GGML_CUDA_DISABLE_FUSION=1 as an environment variable at runtime, it's not a build time option.

TinyServal avatar Nov 03 '25 02:11 TinyServal

Oops, I will try that again with that tomorrow morning. But when I tried it some days ago as env variable on https://github.com/ggml-org/llama.cpp/issues/16912#issuecomment-3476531816, it didn't work either sadly.

Panchovix avatar Nov 03 '25 02:11 Panchovix

@Panchovix if it's indeed GEMV fusion it gets disabled with GGML_CUDA_DISABLE_FUSION=1. You need to make sure it's an env variable that's accessible to your binary. e.g. GGML_CUDA_DISABLE_FUSION=1 <your command> will work. Also please share steps to reproduce, I have a machine with multiple GPUs so I can test

am17an avatar Nov 03 '25 03:11 am17an

@Panchovix instead of manually reverting commits on top of master, please do a git bisect and identify the exact, unmodified master commit that introduced the issue so that devs can use it for reproduction.

JohannesGaessler avatar Nov 03 '25 07:11 JohannesGaessler

@Panchovix if it's indeed GEMV fusion it gets disabled with GGML_CUDA_DISABLE_FUSION=1. You need to make sure it's an env variable that's accessible to your binary. e.g. GGML_CUDA_DISABLE_FUSION=1 <your command> will work. Also please share steps to reproduce, I have a machine with multiple GPUs so I can test

@am17an I tried to launch now with

GGML_CUDA_DISABLE_FUSION=1 ./llama-server -m '/Drive1_8TB/models_llm_2tb/DeepSeek-V3-0324-UD-Q3_K_XL.gguf' -c 32768 -ngl 999 --no-mmap \
-ot "blk.(0|1|2|3|4|5|6|7).ffn.=CUDA0" \
-ot "blk.(8|9|10|11).ffn.=CUDA1" \
-ot "blk.(12|13|14|15).ffn.=CUDA2" \
-ot "blk.(16|17|18|19|20).ffn.=CUDA3" \
-ot "blk.(21|22|23).ffn.=CUDA4" \
-ot "blk.(24|25|26|27).ffn.=CUDA4" \
-ot "blk.(28|29|30|31|32|33|34).ffn.=CUDA5" \
-ot "exps=CPU" \
-mg 0 -ub 2560 -b 2560

But issue persists.

To reproduce, is basically to use 2 or more GPUs, with -ot and the rest of layers with -ot "exps=CPU". Then I go to the UI of llama server and paste a 4096 token prompt (or near), and then check the speed.

@JohannesGaessler I went with this way for git bisect (commit 7e99416 is the one before the General GEMV fusion)

git bisect start
git bisect bad 7e99416
git bisect good 3cfa9c3
rm -r ylenuxtesting
cmake -B ylenuxtesting \
  -DGGML_CUDA=ON \
  -DGGML_CUDA_FA_ALL_QUANTS=ON \
  -DGGML_BLAS=OFF \
  -DGGML_RPC=ON \
  -DCMAKE_CUDA_ARCHITECTURES="86;89;120" \
  -DGGML_MAX_CONTEXTS=2048 \
  -DGGML_SCHED_MAX_COPIES=1 \
cmake --build ylenuxtesting --config Release -j 11
GML_CUDA_DISABLE_FUSION=1 ./llama-server ...
git bisect bad
repeat building and running model with command above
git bisect bad
repeat building and running model with command above
git bisect bad
repeat building and running model with command above
git bisect bad
repeat building and running model with command above
git bisect bad
repeat building and running model with command above
git bisect bad
f77c13b91f4d25754b6a0b857f98a6bc922a0aa7 is the first bad commit
commit f77c13b91f4d25754b6a0b857f98a6bc922a0aa7 (HEAD, tag: b6841)
Author: Aman Gupta <[email protected]>
Date:   Sun Oct 26 19:28:04 2025 +0800

    CUDA: General GEMV fusion (#16715)

 ggml/src/ggml-cuda/common.cuh   |  13 +++++
 ggml/src/ggml-cuda/convert.cuh  |   1 +
 ggml/src/ggml-cuda/ggml-cuda.cu | 353 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 ggml/src/ggml-cuda/mmvf.cu      | 374 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------
 ggml/src/ggml-cuda/mmvf.cuh     |   3 +-
 ggml/src/ggml-cuda/mmvq.cu      | 314 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------------------
 ggml/src/ggml-cuda/mmvq.cuh     |   2 +-
 ggml/src/ggml-cuda/unary.cu     |  14 +----
 ggml/src/ggml-cuda/unary.cuh    |  21 +++++++
 src/llama-graph.cpp             |   6 ++
 tests/test-backend-ops.cpp      | 161 +++++++++++++++++++++++++++++++++++++++++++++++++++
 11 files changed, 1096 insertions(+), 166 deletions(-)
repeat building and running model with command above
works fine
git bisect reset
Previous HEAD position was f77c13b91 CUDA: General GEMV fusion (#16715)
HEAD is now at 7e994168b SYCL: optimized repeat_back kernel (3× fewer asm instructions, 2× faster)Feature/sycl repeat back opt (#16869)

Not sure if I did that bisect correctly.

Panchovix avatar Nov 03 '25 14:11 Panchovix

@Panchovix in your command I see GML_CUDA_DISABLE_FUSION=1, is that a typo?

am17an avatar Nov 03 '25 14:11 am17an

@Panchovix in your command I see GML_CUDA_DISABLE_FUSION=1, is that a typo?

It is a typo when I copy pasted, but I executed as how it looks in this image (with complete GGML)

Image

Updated the command there in the comment, my bad.

Panchovix avatar Nov 03 '25 14:11 Panchovix

What I don't understand is the GEMV fusion path is gated behind that env flag, the only other change is to llama-graph.cpp, which expands ffn_gate and up, and potentially re-orders the graphs. Can you take a look if reverting this in llama-graph.cpp fixes your issues

https://github.com/ggml-org/llama.cpp/pull/16715/files#diff-9be9eea14f4aefce7375482c05968900192634e88e92ac263cedb955a64ad7fe

am17an avatar Nov 03 '25 14:11 am17an

What I don't understand is the GEMV fusion path is gated behind that env flag, the only other change is to llama-graph.cpp, which expands ffn_gate and up, and potentially re-orders the graphs. Can you take a look if reverting this in llama-graph.cpp fixes your issues

https://github.com/ggml-org/llama.cpp/pull/16715/files#diff-9be9eea14f4aefce7375482c05968900192634e88e92ac263cedb955a64ad7fe

@am17an That did it! I commented out "ggml_build_forward_expand(gf, cur);" and "ggml_build_forward_expand(gf, cur);" on src/llama-graph.cpp and it worked with full speed now.

rebuilt with those lines commented out, then
./ylenuxtesting/bin/llama-server -m '/run/media/pancho/Drive1_8TB/models_llm_2tb/DeepSeek-V3-0324-UD-Q3_K_XL.gguf' -c 32768 -ngl 999 --no-mmap \
-ot "blk.(0|1|2|3|4|5|6|7).ffn.=CUDA0" \
-ot "blk.(8|9|10|11).ffn.=CUDA1" \
-ot "blk.(12|13|14|15).ffn.=CUDA2" \
-ot "blk.(16|17|18|19|20).ffn.=CUDA3" \
-ot "blk.(21|22|23).ffn.=CUDA4" \
-ot "blk.(24|25|26|27).ffn.=CUDA4" \
-ot "blk.(28|29|30|31|32|33|34).ffn.=CUDA5" \
-ot "exps=CPU" \
-mg 0 -ub 2560 -b 2560
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 6 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 3: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 4: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
  Device 5: NVIDIA A40, compute capability 8.6, VMM: yes
main: setting n_parallel = 4 and kv_unified = true
build: 6931 (7e994168b) with cc (GCC) 15.2.1 20251022 (Red Hat 15.2.1-3) for x86_64-redhat-linux
system info: n_threads = 12, n_threads_batch = 12, total_threads = 24

system_info: n_threads = 12 (n_threads_batch = 12) / 24 | CUDA : ARCHS = 860,890,1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
...
slot launch_slot_: id  3 | task 0 | processing task
slot update_slots: id  3 | task 0 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 4373
slot update_slots: id  3 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 2560, batch.n_tokens = 2560, progress = 0.585410
slot update_slots: id  3 | task 0 | n_tokens = 2560, memory_seq_rm [2560, end)
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 4373, batch.n_tokens = 1813, progress = 1.000000
slot update_slots: id  3 | task 0 | prompt done, n_tokens = 4373, batch.n_tokens = 1813
slot print_timing: id  3 | task 0 |
prompt eval time =   17690.04 ms /  4373 tokens (    4.05 ms per token,   247.20 tokens per second)
       eval time =   44615.35 ms /   477 tokens (   93.53 ms per token,    10.69 tokens per second)
      total time =   62305.39 ms /  4850 tokens
slot      release: id  3 | task 0 | stop processing: n_tokens = 4849, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /completion 127.0.0.1 200
srv  log_server_r: request: POST /tokenize 127.0.0.1 200
^Csrv    operator(): operator(): cleaning up before exit...
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free      self    model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 5090)   | 32109 = 1426 + ( 29671 =  25363 +     680 +    3628) +        1010 |
llama_memory_breakdown_print: |   - CUDA1 (RTX 4090)   | 24080 =  806 + ( 22369 =  19841 +     476 +    2052) +         905 |
llama_memory_breakdown_print: |   - CUDA2 (RTX 4090)   | 24077 =  851 + ( 22313 =  19842 +     476 +    1995) +         913 |
llama_memory_breakdown_print: |   - CUDA3 (RTX 5090)   | 32109 = 4034 + ( 27032 =  24357 +     680 +    1995) +        1042 |
llama_memory_breakdown_print: |   - CUDA4 (RTX A6000)  | 48539 = 5296 + ( 37492 =  34490 +     952 +    2050) +        5750 |
llama_memory_breakdown_print: |   - CUDA5 (A40)        | 48539 = 4216 + ( 38573 =  35639 +     884 +    2050) +        5748 |
llama_memory_breakdown_print: |   - Host               |                 123387 = 122997 +       0 +     390                |

Panchovix avatar Nov 03 '25 15:11 Panchovix

Great! @slaren I don't exactly know what happened here, but TLDR is that graph_compute_expand causes some non-trivial re-ordering of nodes in the --ot case which leads to this performance drop (nothing to do with fusion)

am17an avatar Nov 03 '25 15:11 am17an

The order of the nodes can affect the number of splits, and increase the amount of data that needs to be transferred between devices. You can use GGML_SCHED_DEBUG=2 to inspect the splits and maybe try to find an order that works better.

slaren avatar Nov 03 '25 15:11 slaren

The solution is to do this manual fix right? Until something changes there.

Panchovix avatar Nov 04 '25 16:11 Panchovix

@Panchovix Did you use GML_CUDA_DISABLE_FUSION=1 at the same time as commenting out those two lines?

jukofyork avatar Nov 04 '25 20:11 jukofyork

@jukofyork I did not, if worked "out of the box" when commenting those lines.

Panchovix avatar Nov 04 '25 20:11 Panchovix

I don't think this should be closed that easily. I built 66d8eccd42b5b5b2179c60a6d41376d3917f3b40 (latest) and 3cfa9c3f125763305b4226bc032f1954f08990dc (commit before GEMV fusion) and compared my different multigpu setups. Latest build is always slower in pp (first is latest): 898.23 ± 0.12 vs 1197.69 ± 0.5 378.41 ± 0.85 vs 418.54 ± 0.52 614.36 ± 1.71 vs 800.91 ± 1.51

I didn't check if it's an issue with lines mentioned above, but as problem persists in every setup I tried I think some problem is definitely here.

Also, I met another problem in latest build. I make two unrelated requests using chat completion. Second request is much slower in both pp and tg in latest build. Probably it's connected with https://github.com/ggml-org/llama.cpp/pull/16736, but I'm not sure. Here are logs:

Latest build:

slot launch_slot_: id  3 | task 0 | processing task
slot update_slots: id  3 | task 0 | new prompt, n_ctx_slot = 32000, n_keep = 0, task.n_tokens = 9088
slot update_slots: id  3 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 2048, batch.n_tokens = 2048, progress = 0.225352
slot update_slots: id  3 | task 0 | n_tokens = 2048, memory_seq_rm [2048, end)
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 4096, batch.n_tokens = 2048, progress = 0.450704
slot update_slots: id  3 | task 0 | n_tokens = 4096, memory_seq_rm [4096, end)
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 6144, batch.n_tokens = 2048, progress = 0.676056
slot update_slots: id  3 | task 0 | n_tokens = 6144, memory_seq_rm [6144, end)
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 8192, batch.n_tokens = 2048, progress = 0.901408
slot update_slots: id  3 | task 0 | n_tokens = 8192, memory_seq_rm [8192, end)
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 9088, batch.n_tokens = 896, progress = 1.000000
slot update_slots: id  3 | task 0 | prompt done, n_tokens = 9088, batch.n_tokens = 896
slot print_timing: id  3 | task 0 | 
prompt eval time =   55852.28 ms /  9088 tokens (    6.15 ms per token,   162.71 tokens per second)
       eval time =   46206.06 ms /   332 tokens (  139.17 ms per token,     7.19 tokens per second)
      total time =  102058.34 ms /  9420 tokens
slot      release: id  3 | task 0 | stop processing: n_tokens = 9419, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /completion 192.168.XXX.XXX 200
srv  log_server_r: request: GET /v1/models 192.168.XXX.XXX 200
srv  params_from_: Chat format: Hermes 2 Pro
slot get_availabl: id  2 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  2 | task 337 | processing task
slot update_slots: id  2 | task 337 | new prompt, n_ctx_slot = 32000, n_keep = 0, task.n_tokens = 2053
slot update_slots: id  2 | task 337 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  2 | task 337 | prompt processing progress, n_tokens = 2048, batch.n_tokens = 2048, progress = 0.997565
slot update_slots: id  2 | task 337 | n_tokens = 2048, memory_seq_rm [2048, end)
slot update_slots: id  2 | task 337 | prompt processing progress, n_tokens = 2053, batch.n_tokens = 5, progress = 1.000000
slot update_slots: id  2 | task 337 | prompt done, n_tokens = 2053, batch.n_tokens = 5
slot print_timing: id  2 | task 337 | 
prompt eval time =   17540.73 ms /  2053 tokens (    8.54 ms per token,   117.04 tokens per second)
       eval time =   22225.87 ms /   156 tokens (  142.47 ms per token,     7.02 tokens per second)
      total time =   39766.61 ms /  2209 tokens
slot      release: id  2 | task 337 | stop processing: n_tokens = 2208, truncated = 0

Build before GEMV fusion:

slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 32000, n_keep = 0, n_prompt_tokens = 9088
slot update_slots: id  0 | task 0 | n_past = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.225352
slot update_slots: id  0 | task 0 | n_past = 2048, memory_seq_rm [2048, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 4096, n_tokens = 2048, progress = 0.450704
slot update_slots: id  0 | task 0 | n_past = 4096, memory_seq_rm [4096, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 6144, n_tokens = 2048, progress = 0.676056
slot update_slots: id  0 | task 0 | n_past = 6144, memory_seq_rm [6144, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 8192, n_tokens = 2048, progress = 0.901408
slot update_slots: id  0 | task 0 | n_past = 8192, memory_seq_rm [8192, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 9088, n_tokens = 896, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 9088, n_tokens = 896
slot print_timing: id  0 | task 0 | 
prompt eval time =   55895.17 ms /  9088 tokens (    6.15 ms per token,   162.59 tokens per second)
       eval time =   51003.89 ms /   360 tokens (  141.68 ms per token,     7.06 tokens per second)
      total time =  106899.06 ms /  9448 tokens
slot      release: id  0 | task 0 | stop processing: n_past = 9447, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /completion 192.168.XXX.XXX 200
srv  log_server_r: request: GET /v1/models 192.168.XXX.XXX 200
got exception: {"code":500,"message":"Assistant response prefill is incompatible with enable_thinking.","type":"server_error"}
srv  log_server_r: request: POST /v1/chat/completions 192.168.XXX.XXX 500
srv  params_from_: Chat format: Hermes 2 Pro
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = 24327265651
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 9447, total state size = 3395.126 MiB
srv          load:  - looking for better prompt, base f_keep = 0.000, sim = 0.001
srv        update:  - cache state: 1 prompts, 3395.126 MiB (limits: 8192.000 MiB, 32000 tokens, 32000 est)
srv        update:    - prompt 0x5eb758fcbe90:    9447 tokens, checkpoints:  0,  3395.126 MiB
srv  get_availabl: prompt cache update took 7050.51 ms
slot launch_slot_: id  0 | task 365 | processing task
slot update_slots: id  0 | task 365 | new prompt, n_ctx_slot = 32000, n_keep = 0, n_prompt_tokens = 3633
slot update_slots: id  0 | task 365 | old: ... [gMASK]<sop> | [System note: Write one reply
slot update_slots: id  0 | task 365 | new: ... [gMASK]<sop> | <|system|>
<TASK>
Start
slot update_slots: id  0 | task 365 |   151331  151333   84329    5185      25    9641     825    9846
slot update_slots: id  0 | task 365 |   151331  151333  151335     198    3125    7384     397    3479
slot update_slots: id  0 | task 365 | n_past = 2, memory_seq_rm [2, end)
slot update_slots: id  0 | task 365 | prompt processing progress, n_past = 2050, n_tokens = 2048, progress = 0.564272
slot update_slots: id  0 | task 365 | n_past = 2050, memory_seq_rm [2050, end)
slot update_slots: id  0 | task 365 | prompt processing progress, n_past = 3633, n_tokens = 1583, progress = 1.000000
slot update_slots: id  0 | task 365 | prompt done, n_past = 3633, n_tokens = 1583
slot print_timing: id  0 | task 365 | 
prompt eval time =   18311.64 ms /  3631 tokens (    5.04 ms per token,   198.29 tokens per second)
       eval time =   33573.69 ms /   345 tokens (   97.32 ms per token,    10.28 tokens per second)
      total time =   51885.33 ms /  3976 tokens
slot      release: id  0 | task 365 | stop processing: n_past = 3977, truncated = 0

wallentri88 avatar Nov 04 '25 23:11 wallentri88

@wallentri88 if you only comment those lines on latest commit does it solve the issue? I wonder if doing a PR that lets you disable or enable those lanes with a env variable would be worth.

Panchovix avatar Nov 05 '25 18:11 Panchovix

Technically those statements are needed only for TG phase to facilitate fusion.

I wonder if we should move them to graph_optimize function on the backend.

am17an avatar Nov 05 '25 18:11 am17an

If a test needed for a possible PR with that movement I can try it.

Panchovix avatar Nov 05 '25 18:11 Panchovix

Same issue with dual 3090 + CPU offload, latest llama.cpp (built from source), with GLM 4.5 on Linux (PP for long context is halved). Things tested:

  • -kvu didn't work
  • GGML_CUDA_DISABLE_FUSION=1 as env variable didn't work
  • GGML_CUDA_DISABLE_GRAPHS=1 as env variable didn't work
  • commenting the two lines in llama-graph.cpp worked.

Fortunately, I found this issue while searching, and @am17an found how to "fix" it. It also uses less VRAM (On CUDA1), and the first GPU (CUDA0) is more used during PP than current master without the "fix". Could it be that master distributes something to CUDA1 to compute stuff, and because CUDA1 has worse PCI bandwidth it has slower PP?

I am running nvidia driver 575.57.08, cuda 12.8 on Debian 13.

Command:

/home/pc/llama.cpp/build/bin/llama-server \
      --model /home/pc/fast/GLM-4.5-GGUF/GLM-4.5-UD-Q2_K_XL-00001-of-00003.gguf \
      -c 60500 \
      --jinja \
      --reasoning-format auto \
      -fa on -ngl 99 -ub 2048 -b 8192 \
      -t 14 \
      --tensor_split 64,30 \
      --n-cpu-moe 85 \
      --no-warmup \
      --cache-reuse 1024 \
      --slot-save-path "/home/pc/fast/cache/llamacpp"

Numbers

Newest llama.cpp:

prompt eval time = 1234920.81 ms / 56698 tokens (   21.78 ms per token,    45.91 tokens per second)
       eval time = 1055488.89 ms /  3802 tokens (  277.61 ms per token,     3.60 tokens per second)
      total time = 2290409.71 ms / 60500 tokens

Commenting the two lines in llama-graph.cpp:

prompt eval time =  584759.42 ms / 56698 tokens (   10.31 ms per token,    96.96 tokens per second)
       eval time =  855014.94 ms /  3108 tokens (  275.10 ms per token,     3.64 tokens per second)
      total time = 1439774.36 ms / 59806 tokens

abc-nix avatar Nov 06 '25 14:11 abc-nix

I would maybe re open the issue if possible, as this still happens on latest version.

Panchovix avatar Nov 06 '25 15:11 Panchovix