Eval bug: When offloading to CPU after f77c13b commit using CUDA (MultiGPU), PP performance seems to be reduced by ~75% (CUDA: General GEMV fusion)
Name and Version
./llama-cli --version ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 6 CUDA devices: Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 3: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes Device 4: NVIDIA RTX A6000, compute capability 8.6, VMM: yes Device 5: NVIDIA A40, compute capability 8.6, VMM: yes version: 6906 (0de0a0157) built with cc (GCC) 15.2.1 20251022 (Red Hat 15.2.1-3) for x86_64-redhat-linux
Operating systems
Linux
GGML backends
CUDA
Hardware
Fedora 42 AMD Ryzen 9 9900X 192GB RAM RTX 5090x2 RTX 4090x2 RTX A6000 RTX A40
Models
DeepSeek-V3-0324 DeepSeek-R1-0528 DeepSeek-V3.1 DeepSeek-V3.1-Terminus
Problem description & steps to reproduce
I build llamacpp with:
cmake -B build \
-DGGML_CUDA=ON \
-DGGML_CUDA_FA_ALL_QUANTS=ON \
-DGGML_BLAS=OFF \
-DGGML_RPC=ON \
-DCMAKE_CUDA_ARCHITECTURES="86;89;120" \
-DGGML_MAX_CONTEXTS=2048 \
When offloading DeepSeek V3 0324/R1 0528/V3.1 models with offloading, on https://github.com/ggml-org/llama.cpp/commit/5d195f17bc60eacc15cfb929f9403cf29ccdf419 commit, with:
LLAMA_SET_ROWS=1 ./llama-server -m '/Drive1_8TB/models_llm_2tb/DeepSeek-V3-0324-UD-Q3_K_XL.gguf' -c 32768 -ngl 999 --no-mmap \
-ot "blk.(0|1|2|3|4|5|6|7).ffn.=CUDA0" \
-ot "blk.(8|9|10|11).ffn.=CUDA1" \
-ot "blk.(12|13|14|15).ffn.=CUDA2" \
-ot "blk.(16|17|18|19|20).ffn.=CUDA3" \
-ot "blk.(21|22|23).ffn.=CUDA4" \
-ot "blk.(24|25|26|27).ffn.=CUDA4" \
-ot "blk.(28|29|30|31|32|33|34).ffn.=CUDA5" \
-ot "exps=CPU" \
-fa on -mg 0 -ub 2560 -b 256
When loading, it looks like this:
load_tensors: CUDA0 model buffer size = 25363.28 MiB
load_tensors: CUDA1 model buffer size = 19841.07 MiB
load_tensors: CUDA2 model buffer size = 19842.82 MiB
load_tensors: CUDA3 model buffer size = 24357.64 MiB
load_tensors: CUDA4 model buffer size = 34490.44 MiB
load_tensors: CUDA5 model buffer size = 35639.92 MiB
load_tensors: CPU model buffer size = 497.11 MiB
load_tensors: CUDA_Host model buffer size = 122500.00 MiB
As you can see, CPU model buffer size is just after GPUs (CUDA) and just before CUDA_Host. This nets me these speeds:
prompt eval time = 17797.43 ms / 4373 tokens ( 4.07 ms per token, 245.71 tokens per second)
eval time = 42683.82 ms / 453 tokens ( 94.22 ms per token, 10.61 tokens per second)
total time = 60481.25 ms / 4826 tokens
A variant of this while testing that also works fine is:
load_tensors: CUDA0 model buffer size = 25363.28 MiB
load_tensors: CUDA1 model buffer size = 19841.07 MiB
load_tensors: CUDA2 model buffer size = 19842.82 MiB
load_tensors: CUDA3 model buffer size = 24357.64 MiB
load_tensors: CUDA4 model buffer size = 34490.44 MiB
load_tensors: CUDA5 model buffer size = 35639.92 MiB
load_tensors: CUDA_Host model buffer size = 122500.00 MiB
load_tensors: CPU model buffer size = 497.11 MiB
While, after https://github.com/ggml-org/llama.cpp/commit/5d195f17bc60eacc15cfb929f9403cf29ccdf419 commit (I'm still not sure if the exact next one is the one causing the issue), it looks like this:
load_tensors: CPU model buffer size = 497.11 MiB
load_tensors: CUDA0 model buffer size = 25363.28 MiB
load_tensors: CUDA1 model buffer size = 19841.07 MiB
load_tensors: CUDA2 model buffer size = 19842.82 MiB
load_tensors: CUDA3 model buffer size = 24357.64 MiB
load_tensors: CUDA4 model buffer size = 34490.44 MiB
load_tensors: CUDA5 model buffer size = 35639.92 MiB
load_tensors: CUDA_Host model buffer size = 122500.00 MiB
Which nets me these speeds
prompt eval time = 49380.49 ms / 4373 tokens ( 11.29 ms per token, 88.56 tokens per second)
eval time = 50832.32 ms / 542 tokens ( 93.79 ms per token, 10.66 tokens per second)
I have deleted and not used ccache on each build to not get any extra issues.
As reference, ik llamacpp handles this, via this way https://github.com/ikawrakow/ik_llama.cpp/pull/405
With this explanation
When part of the tensors are stored in RAM but there are faster back-ends available (GPU), the scheduler needs to decide if to offload the data for a given op to a faster back-end or to compute the op on the CPU. This is currently done via a simple heuristics where only matrix multiplications (GGML_MUL_MAT and GGML_MUL_MAT_ID) are offloaded if the batch size is larger than some threshold (currently 32). When fmoe is enabled, the fused (ffn_up*X)unary(ffn_gateX)) op is never uploaded. In contrast, in mainline llama.cpp matrix multiplications are always offloaded when the batch size is >= 32. The result of this is that when the batch size becomes large enough, llama.cpp will outperform ik_llama.cpp in prompt processing speed. As "large enough" depends on many factors (size of tensors that need to be uploaded, speed of the PCI-E bus to the GPU, relative speed of the GPU vs the CPU), it is hard to devise a better offload policy that automatically takes the best decision.
So it seems that for some reason, now some matrix calculations are done on the CPU instead of the main CUDA device? (CUDA0)
First Bad Commit
I'm not sure where it started exactly, but https://github.com/ggml-org/llama.cpp/commit/5d195f17bc60eacc15cfb929f9403cf29ccdf419 works fine.
Relevant log output
./llama-server -m '/Drive1_8TB/models_llm_2tb/DeepSeek-V3-0324-UD-Q3_K_XL.gguf' -c 32768 --no-mmap -ngl 999 \
-ot "blk.(0|1|2|3|4|5|6|7).ffn.=CUDA0" \
-ot "blk.(8|9|10|11).ffn.=CUDA1" \
-ot "blk.(12|13|14|15).ffn.=CUDA2" \
-ot "blk.(16|17|18|19|20).ffn.=CUDA3" \
-ot "blk.(21|22|23).ffn.=CUDA4" \
-ot "blk.(24|25|26|27).ffn.=CUDA4" \
-ot "blk.(28|29|30|31|32|33|34).ffn.=CUDA5" \
-ot "exps=CPU" \
-fa on -mg 0 -ub 2560 -b 2560 --cache-ram 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 6 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 3: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
Device 4: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
Device 5: NVIDIA A40, compute capability 8.6, VMM: yes
build: 6906 (0de0a0157) with cc (GCC) 15.2.1 20251022 (Red Hat 15.2.1-3) for x86_64-redhat-linux
system info: n_threads = 12, n_threads_batch = 12, total_threads = 24
system_info: n_threads = 12 (n_threads_batch = 12) / 24 | CUDA : ARCHS = 860,890,1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 23
main: loading model
srv load_model: loading model '/Drive1_8TB/models_llm_2tb/DeepSeek-V3-0324-UD-Q3_K_XL.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 5090) (0000:01:00.0) - 31600 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 4090) (0000:02:00.0) - 23686 MiB free
llama_model_load_from_file_impl: using device CUDA2 (NVIDIA GeForce RTX 4090) (0000:17:00.0) - 23675 MiB free
llama_model_load_from_file_impl: using device CUDA3 (NVIDIA GeForce RTX 5090) (0000:03:00.0) - 31600 MiB free
llama_model_load_from_file_impl: using device CUDA4 (NVIDIA RTX A6000) (0000:0d:00.0) - 48268 MiB free
llama_model_load_from_file_impl: using device CUDA5 (NVIDIA A40) (0000:06:00.0) - 48268 MiB free
llama_model_loader: loaded meta data with 64 key-value pairs and 1086 tensors from /Drive1_8TB/models_llm_2tb/DeepSeek-V3-0324-UD-Q3_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = deepseek2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Deepseek-V3-0324
llama_model_loader: - kv 3: general.version str = V3-0324
llama_model_loader: - kv 4: general.basename str = Deepseek-V3-0324
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 256x20B
llama_model_loader: - kv 7: general.license str = mit
llama_model_loader: - kv 8: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 9: general.base_model.count u32 = 1
llama_model_loader: - kv 10: general.base_model.0.name str = DeepSeek V3 0324
llama_model_loader: - kv 11: general.base_model.0.version str = V3-0324
llama_model_loader: - kv 12: general.base_model.0.organization str = Deepseek Ai
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/deepseek-ai/De...
llama_model_loader: - kv 14: general.tags arr[str,4] = ["deepseek_v3", "deepseek", "unsloth"...
llama_model_loader: - kv 15: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 16: deepseek2.block_count u32 = 61
llama_model_loader: - kv 17: deepseek2.context_length u32 = 163840
llama_model_loader: - kv 18: deepseek2.embedding_length u32 = 7168
llama_model_loader: - kv 19: deepseek2.feed_forward_length u32 = 18432
llama_model_loader: - kv 20: deepseek2.attention.head_count u32 = 128
llama_model_loader: - kv 21: deepseek2.attention.head_count_kv u32 = 1
llama_model_loader: - kv 22: deepseek2.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 23: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 24: deepseek2.expert_used_count u32 = 8
llama_model_loader: - kv 25: deepseek2.leading_dense_block_count u32 = 3
llama_model_loader: - kv 26: deepseek2.vocab_size u32 = 129280
llama_model_loader: - kv 27: deepseek2.attention.q_lora_rank u32 = 1536
llama_model_loader: - kv 28: deepseek2.attention.kv_lora_rank u32 = 512
llama_model_loader: - kv 29: deepseek2.attention.key_length u32 = 576
llama_model_loader: - kv 30: deepseek2.attention.value_length u32 = 512
llama_model_loader: - kv 31: deepseek2.attention.key_length_mla u32 = 192
llama_model_loader: - kv 32: deepseek2.attention.value_length_mla u32 = 128
llama_model_loader: - kv 33: deepseek2.expert_feed_forward_length u32 = 2048
llama_model_loader: - kv 34: deepseek2.expert_count u32 = 256
llama_model_loader: - kv 35: deepseek2.expert_shared_count u32 = 1
llama_model_loader: - kv 36: deepseek2.expert_weights_scale f32 = 2.500000
llama_model_loader: - kv 37: deepseek2.expert_weights_norm bool = true
llama_model_loader: - kv 38: deepseek2.expert_gating_func u32 = 2
llama_model_loader: - kv 39: deepseek2.rope.dimension_count u32 = 64
llama_model_loader: - kv 40: deepseek2.rope.scaling.type str = yarn
llama_model_loader: - kv 41: deepseek2.rope.scaling.factor f32 = 40.000000
llama_model_loader: - kv 42: deepseek2.rope.scaling.original_context_length u32 = 4096
llama_model_loader: - kv 43: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
llama_model_loader: - kv 44: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 45: tokenizer.ggml.pre str = deepseek-v3
llama_model_loader: - kv 46: tokenizer.ggml.tokens arr[str,129280] = ["<|begin▁of▁sentence|>", "<�...
llama_model_loader: - kv 47: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 48: tokenizer.ggml.merges arr[str,127741] = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
llama_model_loader: - kv 49: tokenizer.ggml.bos_token_id u32 = 0
llama_model_loader: - kv 50: tokenizer.ggml.eos_token_id u32 = 1
llama_model_loader: - kv 51: tokenizer.ggml.padding_token_id u32 = 2
llama_model_loader: - kv 52: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 53: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 54: tokenizer.chat_template str = {% if not add_generation_prompt is de...
llama_model_loader: - kv 55: general.quantization_version u32 = 2
llama_model_loader: - kv 56: general.file_type u32 = 12
llama_model_loader: - kv 57: quantize.imatrix.file str = DeepSeek-V3-0324-GGUF/imatrix_unsloth...
llama_model_loader: - kv 58: quantize.imatrix.dataset str = unsloth_calibration_DeepSeek-V3-0324.txt
llama_model_loader: - kv 59: quantize.imatrix.entries_count i32 = 720
llama_model_loader: - kv 60: quantize.imatrix.chunks_count i32 = 60
llama_model_loader: - kv 61: split.no u16 = 0
llama_model_loader: - kv 62: split.tensors.count i32 = 1086
llama_model_loader: - kv 63: split.count u16 = 0
llama_model_loader: - type f32: 361 tensors
llama_model_loader: - type q8_0: 122 tensors
llama_model_loader: - type q3_K: 173 tensors
llama_model_loader: - type q4_K: 385 tensors
llama_model_loader: - type q5_K: 29 tensors
llama_model_loader: - type q6_K: 16 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q3_K - Medium
print_info: file size = 275.91 GiB (3.53 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load: - 1 ('<|end▁of▁sentence|>')
load: special tokens cache size = 818
load: token to piece cache size = 0.8223 MB
print_info: arch = deepseek2
print_info: vocab_only = 0
print_info: n_ctx_train = 163840
print_info: n_embd = 7168
print_info: n_layer = 61
print_info: n_head = 128
print_info: n_head_kv = 1
print_info: n_rot = 64
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 576
print_info: n_embd_head_v = 512
print_info: n_gqa = 128
print_info: n_embd_k_gqa = 576
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 18432
print_info: n_expert = 256
print_info: n_expert_used = 8
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = yarn
print_info: freq_base_train = 10000.0
print_info: freq_scale_train = 0.025
print_info: n_ctx_orig_yarn = 4096
print_info: rope_finetuned = unknown
print_info: model type = 671B
print_info: model params = 671.03 B
print_info: general.name = Deepseek-V3-0324
print_info: n_layer_dense_lead = 3
print_info: n_lora_q = 1536
print_info: n_lora_kv = 512
print_info: n_embd_head_k_mla = 192
print_info: n_embd_head_v_mla = 128
print_info: n_ff_exp = 2048
print_info: n_expert_shared = 1
print_info: expert_weights_scale = 2.5
print_info: expert_weights_norm = 1
print_info: expert_gating_func = sigmoid
print_info: rope_yarn_log_mul = 0.1000
print_info: vocab type = BPE
print_info: n_vocab = 129280
print_info: n_merges = 127741
print_info: BOS token = 0 '<|begin▁of▁sentence|>'
print_info: EOS token = 1 '<|end▁of▁sentence|>'
print_info: EOT token = 1 '<|end▁of▁sentence|>'
print_info: PAD token = 2 '<|▁pad▁|>'
print_info: LF token = 201 'Ċ'
print_info: FIM PRE token = 128801 '<|fim▁begin|>'
print_info: FIM SUF token = 128800 '<|fim▁hole|>'
print_info: FIM MID token = 128802 '<|fim▁end|>'
print_info: EOG token = 1 '<|end▁of▁sentence|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 61 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 62/62 layers to GPU
load_tensors: CPU model buffer size = 497.11 MiB
load_tensors: CUDA0 model buffer size = 25363.28 MiB
load_tensors: CUDA1 model buffer size = 19841.07 MiB
load_tensors: CUDA2 model buffer size = 19842.82 MiB
load_tensors: CUDA3 model buffer size = 24357.64 MiB
load_tensors: CUDA4 model buffer size = 34490.44 MiB
load_tensors: CUDA5 model buffer size = 35639.92 MiB
load_tensors: CUDA_Host model buffer size = 122500.00 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 32768
llama_context: n_ctx_per_seq = 32768
llama_context: n_batch = 2560
llama_context: n_ubatch = 2560
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = false
llama_context: freq_base = 10000.0
llama_context: freq_scale = 0.025
llama_context: n_ctx_per_seq (32768) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_context: CUDA_Host output buffer size = 0.49 MiB
llama_kv_cache: CUDA0 KV buffer size = 680.00 MiB
llama_kv_cache: CUDA1 KV buffer size = 476.00 MiB
llama_kv_cache: CUDA2 KV buffer size = 476.00 MiB
llama_kv_cache: CUDA3 KV buffer size = 680.00 MiB
llama_kv_cache: CUDA4 KV buffer size = 952.00 MiB
llama_kv_cache: CUDA5 KV buffer size = 884.00 MiB
llama_kv_cache: size = 4148.00 MiB ( 32768 cells, 61 layers, 1/1 seqs), K (f16): 2196.00 MiB, V (f16): 1952.00 MiB
llama_context: CUDA0 compute buffer size = 3628.50 MiB
llama_context: CUDA1 compute buffer size = 2052.63 MiB
llama_context: CUDA2 compute buffer size = 1995.05 MiB
llama_context: CUDA3 compute buffer size = 1995.05 MiB
llama_context: CUDA4 compute buffer size = 4848.52 MiB
llama_context: CUDA5 compute buffer size = 4848.53 MiB
llama_context: CUDA_Host compute buffer size = 390.07 MiB
llama_context: graph nodes = 4843
llama_context: graph splits = 206 (with bs=2560), 154 (with bs=1)
common_init_from_params: added <|end▁of▁sentence|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 32768
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv init: initializing slots, n_slots = 1
slot init: id 0 | task -1 | new slot n_ctx_slot = 32768
srv init: prompt cache is enabled, size limit: 8192 MiB
srv init: use `--cache-ram 0` to disable the prompt cache
srv init: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
srv init: thinking = 0
main: model loaded
main: chat template, chat_template: {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='', is_first_sp=true, is_last_user=false) %}{%- for message in messages %}{%- if message['role'] == 'system' %}{%- if ns.is_first_sp %}{% set ns.system_prompt = ns.system_prompt + message['content'] %}{% set ns.is_first_sp = false %}{%- else %}{% set ns.system_prompt = ns.system_prompt + '
' + message['content'] %}{%- endif %}{%- endif %}{%- endfor %}{{ bos_token }}{{ ns.system_prompt }}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{%- set ns.is_first = false -%}{%- set ns.is_last_user = true -%}{{'<|User|>' + message['content'] + '<|Assistant|>'}}{%- endif %}{%- if message['role'] == 'assistant' and message['tool_calls'] is defined and message['tool_calls'] is not none %}{%- set ns.is_last_user = false -%}{%- if ns.is_tool %}{{'<|tool▁outputs▁end|>'}}{%- endif %}{%- set ns.is_first = false %}{%- set ns.is_tool = false -%}{%- set ns.is_output_first = true %}{%- for tool in message['tool_calls'] %}{%- if not ns.is_first %}{%- if message['content'] is none %}{{'<|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '
' + '' + '
' + tool['function']['arguments'] + '
' + '' + '<|tool▁call▁end|>'}}{%- else %}{{message['content'] + '<|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '
' + '' + '
' + tool['function']['arguments'] + '
' + '' + '<|tool▁call▁end|>'}}{%- endif %}{%- set ns.is_first = true -%}{%- else %}{{'
' + '<|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '
' + '' + '
' + tool['function']['arguments'] + '
' + '' + '<|tool▁call▁end|>'}}{%- endif %}{%- endfor %}{{'<|tool▁calls▁end|><|end▁of▁sentence|>'}}{%- endif %}{%- if message['role'] == 'assistant' and (message['tool_calls'] is not defined or message['tool_calls'] is none)%}{%- set ns.is_last_user = false -%}{%- if ns.is_tool %}{{'<|tool▁outputs▁end|>' + message['content'] + '<|end▁of▁sentence|>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{{content + '<|end▁of▁sentence|>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_last_user = false -%}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<|tool▁outputs▁begin|><|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- set ns.is_output_first = false %}{%- else %}{{'
<|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<|tool▁outputs▁end|>'}}{% endif %}{% if add_generation_prompt and not ns.is_last_user and not ns.is_tool %}{{'<|Assistant|>'}}{% endif %}, example_format: 'You are a helpful assistant
<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>'
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv update_slots: all slots are idle
srv log_server_r: request: GET /v1/models 127.0.0.1 200
check_double_bos_eos: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
common_sampler_types_from_names: unable to match sampler by name 'tfs_z'
common_sampler_types_from_names: unable to match sampler by name 'typical_p'
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 4373
slot update_slots: id 0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 2560, batch.n_tokens = 2560, progress = 0.585410
slot update_slots: id 0 | task 0 | n_tokens = 2560, memory_seq_rm [2560, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 4373, batch.n_tokens = 1813, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_tokens = 4373, batch.n_tokens = 1813
slot print_timing: id 0 | task 0 |
prompt eval time = 49380.49 ms / 4373 tokens ( 11.29 ms per token, 88.56 tokens per second)
eval time = 50832.32 ms / 542 tokens ( 93.79 ms per token, 10.66 tokens per second)
total time = 100212.80 ms / 4915 tokens
slot release: id 0 | task 0 | stop processing: n_tokens = 4914, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: request: POST /completion 127.0.0.1 200
srv log_server_r: request: POST /tokenize 127.0.0.1 200
^Csrv operator(): operator(): cleaning up before exit...
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
llama_memory_breakdown_print: | - CUDA0 (RTX 5090) | 32109 = 1426 + ( 29671 = 25363 + 680 + 3628) + 1010 |
llama_memory_breakdown_print: | - CUDA1 (RTX 4090) | 24080 = 806 + ( 22369 = 19841 + 476 + 2052) + 905 |
llama_memory_breakdown_print: | - CUDA2 (RTX 4090) | 24077 = 851 + ( 22313 = 19842 + 476 + 1995) + 913 |
llama_memory_breakdown_print: | - CUDA3 (RTX 5090) | 32109 = 4034 + ( 27032 = 24357 + 680 + 1995) + 1042 |
llama_memory_breakdown_print: | - CUDA4 (RTX A6000) | 48539 = 2498 + ( 40290 = 34490 + 952 + 4848) + 5749 |
llama_memory_breakdown_print: | - CUDA5 (A40) | 48539 = 1418 + ( 41372 = 35639 + 884 + 4848) + 5748 |
llama_memory_breakdown_print: | - Host | 123387 = 122997 + 0 + 390 |
As reference, when using https://github.com/ggml-org/llama.cpp/commit/5d195f17bc60eacc15cfb929f9403cf29ccdf419 commit, relevant log output looks like this (and with correct speeds):
./llama-server -m '/Drive1_8TB/models_llm_2tb/DeepSeek-V3-0324-UD-Q3_K_XL.gguf' -c 32768 -ngl 999 --no-mmap \
-ot "blk.(0|1|2|3|4|5|6|7).ffn.=CUDA0" \
-ot "blk.(8|9|10|11).ffn.=CUDA1" \
-ot "blk.(12|13|14|15).ffn.=CUDA2" \
-ot "blk.(16|17|18|19|20).ffn.=CUDA3" \
-ot "blk.(21|22|23).ffn.=CUDA4" \
-ot "blk.(24|25|26|27).ffn.=CUDA4" \
-ot "blk.(28|29|30|31|32|33|34).ffn.=CUDA5" \
-ot "exps=CPU" \
-fa on -mg 0 -ub 2560 -b 2560
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 6 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 3: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
Device 4: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
Device 5: NVIDIA A40, compute capability 8.6, VMM: yes
build: 6839 (5d195f17b) with cc (GCC) 15.2.1 20251022 (Red Hat 15.2.1-3) for x86_64-redhat-linux
system info: n_threads = 12, n_threads_batch = 12, total_threads = 24
system_info: n_threads = 12 (n_threads_batch = 12) / 24 | CUDA : ARCHS = 860,890,1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 23
main: loading model
srv load_model: loading model '/Drive1_8TB/models_llm_2tb/DeepSeek-V3-0324-UD-Q3_K_XL.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 5090) (0000:01:00.0) - 31600 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 4090) (0000:02:00.0) - 23686 MiB free
llama_model_load_from_file_impl: using device CUDA2 (NVIDIA GeForce RTX 4090) (0000:17:00.0) - 23675 MiB free
llama_model_load_from_file_impl: using device CUDA3 (NVIDIA GeForce RTX 5090) (0000:03:00.0) - 31600 MiB free
llama_model_load_from_file_impl: using device CUDA4 (NVIDIA RTX A6000) (0000:0d:00.0) - 48268 MiB free
llama_model_load_from_file_impl: using device CUDA5 (NVIDIA A40) (0000:06:00.0) - 48268 MiB free
llama_model_loader: loaded meta data with 64 key-value pairs and 1086 tensors from /Drive1_8TB/models_llm_2tb/DeepSeek-V3-0324-UD-Q3_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = deepseek2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Deepseek-V3-0324
llama_model_loader: - kv 3: general.version str = V3-0324
llama_model_loader: - kv 4: general.basename str = Deepseek-V3-0324
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 256x20B
llama_model_loader: - kv 7: general.license str = mit
llama_model_loader: - kv 8: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 9: general.base_model.count u32 = 1
llama_model_loader: - kv 10: general.base_model.0.name str = DeepSeek V3 0324
llama_model_loader: - kv 11: general.base_model.0.version str = V3-0324
llama_model_loader: - kv 12: general.base_model.0.organization str = Deepseek Ai
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/deepseek-ai/De...
llama_model_loader: - kv 14: general.tags arr[str,4] = ["deepseek_v3", "deepseek", "unsloth"...
llama_model_loader: - kv 15: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 16: deepseek2.block_count u32 = 61
llama_model_loader: - kv 17: deepseek2.context_length u32 = 163840
llama_model_loader: - kv 18: deepseek2.embedding_length u32 = 7168
llama_model_loader: - kv 19: deepseek2.feed_forward_length u32 = 18432
llama_model_loader: - kv 20: deepseek2.attention.head_count u32 = 128
llama_model_loader: - kv 21: deepseek2.attention.head_count_kv u32 = 1
llama_model_loader: - kv 22: deepseek2.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 23: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 24: deepseek2.expert_used_count u32 = 8
llama_model_loader: - kv 25: deepseek2.leading_dense_block_count u32 = 3
llama_model_loader: - kv 26: deepseek2.vocab_size u32 = 129280
llama_model_loader: - kv 27: deepseek2.attention.q_lora_rank u32 = 1536
llama_model_loader: - kv 28: deepseek2.attention.kv_lora_rank u32 = 512
llama_model_loader: - kv 29: deepseek2.attention.key_length u32 = 576
llama_model_loader: - kv 30: deepseek2.attention.value_length u32 = 512
llama_model_loader: - kv 31: deepseek2.attention.key_length_mla u32 = 192
llama_model_loader: - kv 32: deepseek2.attention.value_length_mla u32 = 128
llama_model_loader: - kv 33: deepseek2.expert_feed_forward_length u32 = 2048
llama_model_loader: - kv 34: deepseek2.expert_count u32 = 256
llama_model_loader: - kv 35: deepseek2.expert_shared_count u32 = 1
llama_model_loader: - kv 36: deepseek2.expert_weights_scale f32 = 2.500000
llama_model_loader: - kv 37: deepseek2.expert_weights_norm bool = true
llama_model_loader: - kv 38: deepseek2.expert_gating_func u32 = 2
llama_model_loader: - kv 39: deepseek2.rope.dimension_count u32 = 64
llama_model_loader: - kv 40: deepseek2.rope.scaling.type str = yarn
llama_model_loader: - kv 41: deepseek2.rope.scaling.factor f32 = 40.000000
llama_model_loader: - kv 42: deepseek2.rope.scaling.original_context_length u32 = 4096
llama_model_loader: - kv 43: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
llama_model_loader: - kv 44: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 45: tokenizer.ggml.pre str = deepseek-v3
llama_model_loader: - kv 46: tokenizer.ggml.tokens arr[str,129280] = ["<|begin▁of▁sentence|>", "<�...
llama_model_loader: - kv 47: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 48: tokenizer.ggml.merges arr[str,127741] = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
llama_model_loader: - kv 49: tokenizer.ggml.bos_token_id u32 = 0
llama_model_loader: - kv 50: tokenizer.ggml.eos_token_id u32 = 1
llama_model_loader: - kv 51: tokenizer.ggml.padding_token_id u32 = 2
llama_model_loader: - kv 52: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 53: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 54: tokenizer.chat_template str = {% if not add_generation_prompt is de...
llama_model_loader: - kv 55: general.quantization_version u32 = 2
llama_model_loader: - kv 56: general.file_type u32 = 12
llama_model_loader: - kv 57: quantize.imatrix.file str = DeepSeek-V3-0324-GGUF/imatrix_unsloth...
llama_model_loader: - kv 58: quantize.imatrix.dataset str = unsloth_calibration_DeepSeek-V3-0324.txt
llama_model_loader: - kv 59: quantize.imatrix.entries_count i32 = 720
llama_model_loader: - kv 60: quantize.imatrix.chunks_count i32 = 60
llama_model_loader: - kv 61: split.no u16 = 0
llama_model_loader: - kv 62: split.tensors.count i32 = 1086
llama_model_loader: - kv 63: split.count u16 = 0
llama_model_loader: - type f32: 361 tensors
llama_model_loader: - type q8_0: 122 tensors
llama_model_loader: - type q3_K: 173 tensors
llama_model_loader: - type q4_K: 385 tensors
llama_model_loader: - type q5_K: 29 tensors
llama_model_loader: - type q6_K: 16 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q3_K - Medium
print_info: file size = 275.91 GiB (3.53 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load: - 1 ('<|end▁of▁sentence|>')
load: special tokens cache size = 818
load: token to piece cache size = 0.8223 MB
print_info: arch = deepseek2
print_info: vocab_only = 0
print_info: n_ctx_train = 163840
print_info: n_embd = 7168
print_info: n_layer = 61
print_info: n_head = 128
print_info: n_head_kv = 1
print_info: n_rot = 64
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 576
print_info: n_embd_head_v = 512
print_info: n_gqa = 128
print_info: n_embd_k_gqa = 576
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 18432
print_info: n_expert = 256
print_info: n_expert_used = 8
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = yarn
print_info: freq_base_train = 10000.0
print_info: freq_scale_train = 0.025
print_info: n_ctx_orig_yarn = 4096
print_info: rope_finetuned = unknown
print_info: model type = 671B
print_info: model params = 671.03 B
print_info: general.name = Deepseek-V3-0324
print_info: n_layer_dense_lead = 3
print_info: n_lora_q = 1536
print_info: n_lora_kv = 512
print_info: n_embd_head_k_mla = 192
print_info: n_embd_head_v_mla = 128
print_info: n_ff_exp = 2048
print_info: n_expert_shared = 1
print_info: expert_weights_scale = 2.5
print_info: expert_weights_norm = 1
print_info: expert_gating_func = sigmoid
print_info: rope_yarn_log_mul = 0.1000
print_info: vocab type = BPE
print_info: n_vocab = 129280
print_info: n_merges = 127741
print_info: BOS token = 0 '<|begin▁of▁sentence|>'
print_info: EOS token = 1 '<|end▁of▁sentence|>'
print_info: EOT token = 1 '<|end▁of▁sentence|>'
print_info: PAD token = 2 '<|▁pad▁|>'
print_info: LF token = 201 'Ċ'
print_info: FIM PRE token = 128801 '<|fim▁begin|>'
print_info: FIM SUF token = 128800 '<|fim▁hole|>'
print_info: FIM MID token = 128802 '<|fim▁end|>'
print_info: EOG token = 1 '<|end▁of▁sentence|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 61 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 62/62 layers to GPU
load_tensors: CUDA0 model buffer size = 25363.28 MiB
load_tensors: CUDA1 model buffer size = 19841.07 MiB
load_tensors: CUDA2 model buffer size = 19842.82 MiB
load_tensors: CUDA3 model buffer size = 24357.64 MiB
load_tensors: CUDA4 model buffer size = 34490.44 MiB
load_tensors: CUDA5 model buffer size = 35639.92 MiB
load_tensors: CPU model buffer size = 497.11 MiB
load_tensors: CUDA_Host model buffer size = 122500.00 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 32768
llama_context: n_ctx_per_seq = 32768
llama_context: n_batch = 2560
llama_context: n_ubatch = 2560
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = false
llama_context: freq_base = 10000.0
llama_context: freq_scale = 0.025
llama_context: n_ctx_per_seq (32768) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_context: CUDA_Host output buffer size = 0.49 MiB
llama_kv_cache: CUDA0 KV buffer size = 680.00 MiB
llama_kv_cache: CUDA1 KV buffer size = 476.00 MiB
llama_kv_cache: CUDA2 KV buffer size = 476.00 MiB
llama_kv_cache: CUDA3 KV buffer size = 680.00 MiB
llama_kv_cache: CUDA4 KV buffer size = 952.00 MiB
llama_kv_cache: CUDA5 KV buffer size = 884.00 MiB
llama_kv_cache: size = 4148.00 MiB ( 32768 cells, 61 layers, 1/1 seqs), K (f16): 2196.00 MiB, V (f16): 1952.00 MiB
llama_context: CUDA0 compute buffer size = 3628.50 MiB
llama_context: CUDA1 compute buffer size = 2052.63 MiB
llama_context: CUDA2 compute buffer size = 1995.05 MiB
llama_context: CUDA3 compute buffer size = 1995.05 MiB
llama_context: CUDA4 compute buffer size = 2050.05 MiB
llama_context: CUDA5 compute buffer size = 2050.06 MiB
llama_context: CUDA_Host compute buffer size = 390.07 MiB
llama_context: graph nodes = 4785
llama_context: graph splits = 206 (with bs=2560), 154 (with bs=1)
common_init_from_params: added <|end▁of▁sentence|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 32768
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv init: initializing slots, n_slots = 1
slot init: id 0 | task -1 | new slot n_ctx_slot = 32768
srv init: prompt cache is enabled, size limit: 8192 MiB
srv init: use `--cache-ram 0` to disable the prompt cache
srv init: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
srv init: thinking = 0
main: model loaded
main: chat template, chat_template: {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='', is_first_sp=true, is_last_user=false) %}{%- for message in messages %}{%- if message['role'] == 'system' %}{%- if ns.is_first_sp %}{% set ns.system_prompt = ns.system_prompt + message['content'] %}{% set ns.is_first_sp = false %}{%- else %}{% set ns.system_prompt = ns.system_prompt + '
' + message['content'] %}{%- endif %}{%- endif %}{%- endfor %}{{ bos_token }}{{ ns.system_prompt }}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{%- set ns.is_first = false -%}{%- set ns.is_last_user = true -%}{{'<|User|>' + message['content'] + '<|Assistant|>'}}{%- endif %}{%- if message['role'] == 'assistant' and message['tool_calls'] is defined and message['tool_calls'] is not none %}{%- set ns.is_last_user = false -%}{%- if ns.is_tool %}{{'<|tool▁outputs▁end|>'}}{%- endif %}{%- set ns.is_first = false %}{%- set ns.is_tool = false -%}{%- set ns.is_output_first = true %}{%- for tool in message['tool_calls'] %}{%- if not ns.is_first %}{%- if message['content'] is none %}{{'<|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '
' + '```json' + '
' + tool['function']['arguments'] + '
' + '```' + '<|tool▁call▁end|>'}}{%- else %}{{message['content'] + '<|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '
' + '```json' + '
' + tool['function']['arguments'] + '
' + '```' + '<|tool▁call▁end|>'}}{%- endif %}{%- set ns.is_first = true -%}{%- else %}{{'
' + '<|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '
' + '```json' + '
' + tool['function']['arguments'] + '
' + '```' + '<|tool▁call▁end|>'}}{%- endif %}{%- endfor %}{{'<|tool▁calls▁end|><|end▁of▁sentence|>'}}{%- endif %}{%- if message['role'] == 'assistant' and (message['tool_calls'] is not defined or message['tool_calls'] is none)%}{%- set ns.is_last_user = false -%}{%- if ns.is_tool %}{{'<|tool▁outputs▁end|>' + message['content'] + '<|end▁of▁sentence|>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{{content + '<|end▁of▁sentence|>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_last_user = false -%}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<|tool▁outputs▁begin|><|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- set ns.is_output_first = false %}{%- else %}{{'
<|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<|tool▁outputs▁end|>'}}{% endif %}{% if add_generation_prompt and not ns.is_last_user and not ns.is_tool %}{{'<|Assistant|>'}}{% endif %}, example_format: 'You are a helpful assistant
<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>'
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv update_slots: all slots are idle
check_double_bos_eos: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
common_sampler_types_from_names: unable to match sampler by name 'tfs_z'
common_sampler_types_from_names: unable to match sampler by name 'typical_p'
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 4373
slot update_slots: id 0 | task 0 | n_past = 0, memory_seq_rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 2560, n_tokens = 2560, progress = 0.585410
slot update_slots: id 0 | task 0 | n_past = 2560, memory_seq_rm [2560, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 4373, n_tokens = 1813, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 4373, n_tokens = 1813
slot print_timing: id 0 | task 0 |
prompt eval time = 17807.96 ms / 4373 tokens ( 4.07 ms per token, 245.56 tokens per second)
eval time = 43334.85 ms / 441 tokens ( 98.26 ms per token, 10.18 tokens per second)
total time = 61142.81 ms / 4814 tokens
slot release: id 0 | task 0 | stop processing: n_past = 4813, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: request: POST /completion 127.0.0.1 200
srv log_server_r: request: POST /tokenize 127.0.0.1 200
^Csrv operator(): operator(): cleaning up before exit...
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
llama_memory_breakdown_print: | - CUDA0 (RTX 5090) | 32109 = 1426 + ( 29671 = 25363 + 680 + 3628) + 1010 |
llama_memory_breakdown_print: | - CUDA1 (RTX 4090) | 24080 = 806 + ( 22369 = 19841 + 476 + 2052) + 905 |
llama_memory_breakdown_print: | - CUDA2 (RTX 4090) | 24077 = 851 + ( 22313 = 19842 + 476 + 1995) + 913 |
llama_memory_breakdown_print: | - CUDA3 (RTX 5090) | 32109 = 4034 + ( 27032 = 24357 + 680 + 1995) + 1042 |
llama_memory_breakdown_print: | - CUDA4 (RTX A6000) | 48539 = 5296 + ( 37492 = 34490 + 952 + 2050) + 5750 |
llama_memory_breakdown_print: | - CUDA5 (A40) | 48539 = 4216 + ( 38573 = 35639 + 884 + 2050) + 5748 |
llama_memory_breakdown_print: | - Host | 123387 = 122997 + 0 + 390 |
I think it might be #16715, but I'm not sure how the fusion would affect offload. @slaren can you help?
@Panchovix can you confirm if the problem goes away with GGML_CUDA_DISABLE_FUSION=1?
Yeah, I've noticed something similar as I've got a custom hack in the CUDA backend that patches the batch size is >= 32 to a value I can read in using an environment variable.
I noticed last week the break-even value for deepseek went from 1800-1900 to ~2700, and kimi-k2 from 2700-2800 to ~4000.
I'll try the GGML_CUDA_DISABLE_FUSION setting and report back if it reduces the break-even value back.
~One other weird thing I noticed was this:~
BATCH_SIZE=8192
PP_SIZE=2048
./llama-bench \
--model "$MODEL_PATH" \
--batch-size $BATCH_SIZE \
--ubatch-size $BATCH_SIZE \
--n-gpu-layers 99 \
--flash-attn 1 \
--numa distribute \
--threads $(nproc) \
--override-tensor exps=CPU \
--n-prompt $PP_SIZE \
--n-gen 0 \
--no-op-offload 1,0
~If you run this and change PP_SIZE to be a multiple of 512 then you get way better PP speed - @Panchovix can you test something similar for your setup and see if this is the case for you too? This might help narrow down the problem.~
~This wasn't the case before, as I was using PP_SIZE = 1800, etc before.~
this isn't actually related to this as just realised it does it for the non-offloaded run too - please ignore!
I would need a simpler way to reproduce this (e.g. single GPU, no -ot). If that's not possible, then you can try dumping the graph splits with GGML_SCHED_DEBUG=2 and try to figure what has changed between the two versions.
Hello, sorry for the delay, I was not home.
Pardon my ignorance as I don't know if the GGML variables have to be compiled or used as env variable
First I tried @am17an suggestion, and compiled with -DGGML_CUDA_DISABLE_FUSION=1, but issue persists.
Ran with GGML_CUDA_DISABLE_FUSION=1 when loading the model and issue persists.
I noticed that for some reason on latest commits it seems to be do a prompt processing part on A40/A6000 GPUs, noted by their RX on nvtop, limited by PCIe X4 4.0
It goes A6000 -> A40 -> 5090.
While on https://github.com/ggml-org/llama.cpp/commit/5d195f17bc60eacc15cfb929f9403cf29ccdf419 commit, seems to be done all mostly on CUDA 0 at the limit of PCIe X8 5.0.
It goes directly to the 5090.
@slaren how would I exactly use GGML_SCHED_DEBUG=2? I did try both -DGGML_SCHED_DEBUG=2 when compiling and GGML_SCHED_DEBUG=2 as env variable but I don't see an output difference. Would I have to use -v?
Okay I think yes I had to use -v.
I attach the outputs when using GGML_SCHED_DEBUG=2. Output is gigantic.
Latest commit as of yesterday with the issue (https://github.com/ggml-org/llama.cpp/commit/0de0a01576772032008a689afc4d7c80685074c4)
And https://github.com/ggml-org/llama.cpp/commit/5d195f17bc60eacc15cfb929f9403cf29ccdf419 where it works correctly.
I did a few more tests if it helps, now I'm using 2x3090 instead of an A40 but the issue persists.
I tried this PR https://github.com/ggml-org/llama.cpp/pull/16935 (CUDA: avoid mul + bias fusion when buffers are split) but issue persists.
Then, also on n latest commit, using fa auto, it gives this message:
llama_context: layer 0 is assigned to device CUDA0 but the Flash Attention tensor is assigned to device CPU (usually due to missing support)
llama_context: Flash Attention was auto, set to disabled
While on https://github.com/ggml-org/llama.cpp/commit/5d195f17bc60eacc15cfb929f9403cf29ccdf419, it gave:
llama_kv_cache: size = 3046.19 MiB ( 24064 cells, 61 layers, 1/1 seqs), K (f16): 1612.69 MiB, V (f16): 1433.50 MiB
llama_context: Flash Attention was auto, set to enabled
So maybe it is related to that?
First of all, sorry for changing the title so many times, but finally found the commit.
After doing more tests, I can confirm that "CUDA: General GEMV fusion" is where the issue starts (commit f77c13b91f4d25754b6a0b857f98a6bc922a0aa7).
Now, there is another commit that I though it was related, that puts CPU model buffer first, which is commit 7a0e900.
First I tried reverting the commit that order buffers (7a0e900) and building with -DGGML_CUDA_DISABLE_FUSION=1, but sadly it didn't work.
Then, I went into the commit that order buffers 7a0e900 and reverted "CUDA: add unused vars to mmvf and mmvq" (463bbf2) and then reverted "CUDA: General GEMV fusion" (f77c13b), built normally (without DGGML_CUDA_DISABLE_FUSION) to try if commits are related and it looks like this when loading:
load_tensors: CPU model buffer size = 497.11 MiB
load_tensors: CUDA0 model buffer size = 25363.28 MiB
load_tensors: CUDA1 model buffer size = 19841.07 MiB
load_tensors: CUDA2 model buffer size = 19842.82 MiB
load_tensors: CUDA3 model buffer size = 24357.64 MiB
load_tensors: CUDA4 model buffer size = 34490.44 MiB
load_tensors: CUDA5 model buffer size = 35639.92 MiB
load_tensors: CUDA_Host model buffer size = 122500.00 MiB
(CPU model buffer first), and here it works fine!
rompt eval time = 17781.40 ms / 4373 tokens ( 4.07 ms per token, 245.93 tokens per second)
eval time = 40457.04 ms / 427 tokens ( 94.75 ms per token, 10.55 tokens per second)
Then at the end, on latest master commit (7e99416), reverted in this order: First, reverted "CUDA: add expert reduce kernel" (4146d6a), then reverted "CUDA: add unused vars to mmvf and mmvq" (463bbf2) and then reverted "CUDA: General GEMV fusion" (f77c13b91f4d25754b6a0b857f98a6bc922a0aa7) (when resolving conflicts, I kept incoming instead of current), built normally (without DGGML_CUDA_DISABLE_FUSION) and here it also works correctly!
Also, this with a single GPU + offloading it doesn't happen (tested on DeepSeek V2), so it seems to be a multiGPU bug.
@am17an, @slaren and @JohannesGaessler, sorry for pinging you guys, but do you may have an idea what could be causing this? Maybe a way to disable General GEMV fusion would work as well.
I can do any test tomorrow if needed, as it is late here on Chile.
You need to set GGML_CUDA_DISABLE_FUSION=1 as an environment variable at runtime, it's not a build time option.
Oops, I will try that again with that tomorrow morning. But when I tried it some days ago as env variable on https://github.com/ggml-org/llama.cpp/issues/16912#issuecomment-3476531816, it didn't work either sadly.
@Panchovix if it's indeed GEMV fusion it gets disabled with GGML_CUDA_DISABLE_FUSION=1. You need to make sure it's an env variable that's accessible to your binary. e.g. GGML_CUDA_DISABLE_FUSION=1 <your command> will work. Also please share steps to reproduce, I have a machine with multiple GPUs so I can test
@Panchovix instead of manually reverting commits on top of master, please do a git bisect and identify the exact, unmodified master commit that introduced the issue so that devs can use it for reproduction.
@Panchovix if it's indeed GEMV fusion it gets disabled with
GGML_CUDA_DISABLE_FUSION=1. You need to make sure it's an env variable that's accessible to your binary. e.g.GGML_CUDA_DISABLE_FUSION=1 <your command>will work. Also please share steps to reproduce, I have a machine with multiple GPUs so I can test
@am17an I tried to launch now with
GGML_CUDA_DISABLE_FUSION=1 ./llama-server -m '/Drive1_8TB/models_llm_2tb/DeepSeek-V3-0324-UD-Q3_K_XL.gguf' -c 32768 -ngl 999 --no-mmap \
-ot "blk.(0|1|2|3|4|5|6|7).ffn.=CUDA0" \
-ot "blk.(8|9|10|11).ffn.=CUDA1" \
-ot "blk.(12|13|14|15).ffn.=CUDA2" \
-ot "blk.(16|17|18|19|20).ffn.=CUDA3" \
-ot "blk.(21|22|23).ffn.=CUDA4" \
-ot "blk.(24|25|26|27).ffn.=CUDA4" \
-ot "blk.(28|29|30|31|32|33|34).ffn.=CUDA5" \
-ot "exps=CPU" \
-mg 0 -ub 2560 -b 2560
But issue persists.
To reproduce, is basically to use 2 or more GPUs, with -ot and the rest of layers with -ot "exps=CPU". Then I go to the UI of llama server and paste a 4096 token prompt (or near), and then check the speed.
@JohannesGaessler I went with this way for git bisect (commit 7e99416 is the one before the General GEMV fusion)
git bisect start
git bisect bad 7e99416
git bisect good 3cfa9c3
rm -r ylenuxtesting
cmake -B ylenuxtesting \
-DGGML_CUDA=ON \
-DGGML_CUDA_FA_ALL_QUANTS=ON \
-DGGML_BLAS=OFF \
-DGGML_RPC=ON \
-DCMAKE_CUDA_ARCHITECTURES="86;89;120" \
-DGGML_MAX_CONTEXTS=2048 \
-DGGML_SCHED_MAX_COPIES=1 \
cmake --build ylenuxtesting --config Release -j 11
GML_CUDA_DISABLE_FUSION=1 ./llama-server ...
git bisect bad
repeat building and running model with command above
git bisect bad
repeat building and running model with command above
git bisect bad
repeat building and running model with command above
git bisect bad
repeat building and running model with command above
git bisect bad
repeat building and running model with command above
git bisect bad
f77c13b91f4d25754b6a0b857f98a6bc922a0aa7 is the first bad commit
commit f77c13b91f4d25754b6a0b857f98a6bc922a0aa7 (HEAD, tag: b6841)
Author: Aman Gupta <[email protected]>
Date: Sun Oct 26 19:28:04 2025 +0800
CUDA: General GEMV fusion (#16715)
ggml/src/ggml-cuda/common.cuh | 13 +++++
ggml/src/ggml-cuda/convert.cuh | 1 +
ggml/src/ggml-cuda/ggml-cuda.cu | 353 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
ggml/src/ggml-cuda/mmvf.cu | 374 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------
ggml/src/ggml-cuda/mmvf.cuh | 3 +-
ggml/src/ggml-cuda/mmvq.cu | 314 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------------------
ggml/src/ggml-cuda/mmvq.cuh | 2 +-
ggml/src/ggml-cuda/unary.cu | 14 +----
ggml/src/ggml-cuda/unary.cuh | 21 +++++++
src/llama-graph.cpp | 6 ++
tests/test-backend-ops.cpp | 161 +++++++++++++++++++++++++++++++++++++++++++++++++++
11 files changed, 1096 insertions(+), 166 deletions(-)
repeat building and running model with command above
works fine
git bisect reset
Previous HEAD position was f77c13b91 CUDA: General GEMV fusion (#16715)
HEAD is now at 7e994168b SYCL: optimized repeat_back kernel (3× fewer asm instructions, 2× faster)Feature/sycl repeat back opt (#16869)
Not sure if I did that bisect correctly.
@Panchovix in your command I see GML_CUDA_DISABLE_FUSION=1, is that a typo?
@Panchovix in your command I see
GML_CUDA_DISABLE_FUSION=1, is that a typo?
It is a typo when I copy pasted, but I executed as how it looks in this image (with complete GGML)
Updated the command there in the comment, my bad.
What I don't understand is the GEMV fusion path is gated behind that env flag, the only other change is to llama-graph.cpp, which expands ffn_gate and up, and potentially re-orders the graphs. Can you take a look if reverting this in llama-graph.cpp fixes your issues
https://github.com/ggml-org/llama.cpp/pull/16715/files#diff-9be9eea14f4aefce7375482c05968900192634e88e92ac263cedb955a64ad7fe
What I don't understand is the GEMV fusion path is gated behind that env flag, the only other change is to llama-graph.cpp, which expands ffn_gate and up, and potentially re-orders the graphs. Can you take a look if reverting this in llama-graph.cpp fixes your issues
https://github.com/ggml-org/llama.cpp/pull/16715/files#diff-9be9eea14f4aefce7375482c05968900192634e88e92ac263cedb955a64ad7fe
@am17an That did it! I commented out "ggml_build_forward_expand(gf, cur);" and "ggml_build_forward_expand(gf, cur);" on src/llama-graph.cpp and it worked with full speed now.
rebuilt with those lines commented out, then
./ylenuxtesting/bin/llama-server -m '/run/media/pancho/Drive1_8TB/models_llm_2tb/DeepSeek-V3-0324-UD-Q3_K_XL.gguf' -c 32768 -ngl 999 --no-mmap \
-ot "blk.(0|1|2|3|4|5|6|7).ffn.=CUDA0" \
-ot "blk.(8|9|10|11).ffn.=CUDA1" \
-ot "blk.(12|13|14|15).ffn.=CUDA2" \
-ot "blk.(16|17|18|19|20).ffn.=CUDA3" \
-ot "blk.(21|22|23).ffn.=CUDA4" \
-ot "blk.(24|25|26|27).ffn.=CUDA4" \
-ot "blk.(28|29|30|31|32|33|34).ffn.=CUDA5" \
-ot "exps=CPU" \
-mg 0 -ub 2560 -b 2560
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 6 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 3: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
Device 4: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
Device 5: NVIDIA A40, compute capability 8.6, VMM: yes
main: setting n_parallel = 4 and kv_unified = true
build: 6931 (7e994168b) with cc (GCC) 15.2.1 20251022 (Red Hat 15.2.1-3) for x86_64-redhat-linux
system info: n_threads = 12, n_threads_batch = 12, total_threads = 24
system_info: n_threads = 12 (n_threads_batch = 12) / 24 | CUDA : ARCHS = 860,890,1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
...
slot launch_slot_: id 3 | task 0 | processing task
slot update_slots: id 3 | task 0 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 4373
slot update_slots: id 3 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 2560, batch.n_tokens = 2560, progress = 0.585410
slot update_slots: id 3 | task 0 | n_tokens = 2560, memory_seq_rm [2560, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 4373, batch.n_tokens = 1813, progress = 1.000000
slot update_slots: id 3 | task 0 | prompt done, n_tokens = 4373, batch.n_tokens = 1813
slot print_timing: id 3 | task 0 |
prompt eval time = 17690.04 ms / 4373 tokens ( 4.05 ms per token, 247.20 tokens per second)
eval time = 44615.35 ms / 477 tokens ( 93.53 ms per token, 10.69 tokens per second)
total time = 62305.39 ms / 4850 tokens
slot release: id 3 | task 0 | stop processing: n_tokens = 4849, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: request: POST /completion 127.0.0.1 200
srv log_server_r: request: POST /tokenize 127.0.0.1 200
^Csrv operator(): operator(): cleaning up before exit...
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
llama_memory_breakdown_print: | - CUDA0 (RTX 5090) | 32109 = 1426 + ( 29671 = 25363 + 680 + 3628) + 1010 |
llama_memory_breakdown_print: | - CUDA1 (RTX 4090) | 24080 = 806 + ( 22369 = 19841 + 476 + 2052) + 905 |
llama_memory_breakdown_print: | - CUDA2 (RTX 4090) | 24077 = 851 + ( 22313 = 19842 + 476 + 1995) + 913 |
llama_memory_breakdown_print: | - CUDA3 (RTX 5090) | 32109 = 4034 + ( 27032 = 24357 + 680 + 1995) + 1042 |
llama_memory_breakdown_print: | - CUDA4 (RTX A6000) | 48539 = 5296 + ( 37492 = 34490 + 952 + 2050) + 5750 |
llama_memory_breakdown_print: | - CUDA5 (A40) | 48539 = 4216 + ( 38573 = 35639 + 884 + 2050) + 5748 |
llama_memory_breakdown_print: | - Host | 123387 = 122997 + 0 + 390 |
Great! @slaren I don't exactly know what happened here, but TLDR is that graph_compute_expand causes some non-trivial re-ordering of nodes in the --ot case which leads to this performance drop (nothing to do with fusion)
The order of the nodes can affect the number of splits, and increase the amount of data that needs to be transferred between devices. You can use GGML_SCHED_DEBUG=2 to inspect the splits and maybe try to find an order that works better.
The solution is to do this manual fix right? Until something changes there.
@Panchovix Did you use GML_CUDA_DISABLE_FUSION=1 at the same time as commenting out those two lines?
@jukofyork I did not, if worked "out of the box" when commenting those lines.
I don't think this should be closed that easily. I built 66d8eccd42b5b5b2179c60a6d41376d3917f3b40 (latest) and 3cfa9c3f125763305b4226bc032f1954f08990dc (commit before GEMV fusion) and compared my different multigpu setups. Latest build is always slower in pp (first is latest): 898.23 ± 0.12 vs 1197.69 ± 0.5 378.41 ± 0.85 vs 418.54 ± 0.52 614.36 ± 1.71 vs 800.91 ± 1.51
I didn't check if it's an issue with lines mentioned above, but as problem persists in every setup I tried I think some problem is definitely here.
Also, I met another problem in latest build. I make two unrelated requests using chat completion. Second request is much slower in both pp and tg in latest build. Probably it's connected with https://github.com/ggml-org/llama.cpp/pull/16736, but I'm not sure. Here are logs:
Latest build:
slot launch_slot_: id 3 | task 0 | processing task
slot update_slots: id 3 | task 0 | new prompt, n_ctx_slot = 32000, n_keep = 0, task.n_tokens = 9088
slot update_slots: id 3 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 2048, batch.n_tokens = 2048, progress = 0.225352
slot update_slots: id 3 | task 0 | n_tokens = 2048, memory_seq_rm [2048, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 4096, batch.n_tokens = 2048, progress = 0.450704
slot update_slots: id 3 | task 0 | n_tokens = 4096, memory_seq_rm [4096, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 6144, batch.n_tokens = 2048, progress = 0.676056
slot update_slots: id 3 | task 0 | n_tokens = 6144, memory_seq_rm [6144, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 8192, batch.n_tokens = 2048, progress = 0.901408
slot update_slots: id 3 | task 0 | n_tokens = 8192, memory_seq_rm [8192, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 9088, batch.n_tokens = 896, progress = 1.000000
slot update_slots: id 3 | task 0 | prompt done, n_tokens = 9088, batch.n_tokens = 896
slot print_timing: id 3 | task 0 |
prompt eval time = 55852.28 ms / 9088 tokens ( 6.15 ms per token, 162.71 tokens per second)
eval time = 46206.06 ms / 332 tokens ( 139.17 ms per token, 7.19 tokens per second)
total time = 102058.34 ms / 9420 tokens
slot release: id 3 | task 0 | stop processing: n_tokens = 9419, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: request: POST /completion 192.168.XXX.XXX 200
srv log_server_r: request: GET /v1/models 192.168.XXX.XXX 200
srv params_from_: Chat format: Hermes 2 Pro
slot get_availabl: id 2 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id 2 | task 337 | processing task
slot update_slots: id 2 | task 337 | new prompt, n_ctx_slot = 32000, n_keep = 0, task.n_tokens = 2053
slot update_slots: id 2 | task 337 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 2 | task 337 | prompt processing progress, n_tokens = 2048, batch.n_tokens = 2048, progress = 0.997565
slot update_slots: id 2 | task 337 | n_tokens = 2048, memory_seq_rm [2048, end)
slot update_slots: id 2 | task 337 | prompt processing progress, n_tokens = 2053, batch.n_tokens = 5, progress = 1.000000
slot update_slots: id 2 | task 337 | prompt done, n_tokens = 2053, batch.n_tokens = 5
slot print_timing: id 2 | task 337 |
prompt eval time = 17540.73 ms / 2053 tokens ( 8.54 ms per token, 117.04 tokens per second)
eval time = 22225.87 ms / 156 tokens ( 142.47 ms per token, 7.02 tokens per second)
total time = 39766.61 ms / 2209 tokens
slot release: id 2 | task 337 | stop processing: n_tokens = 2208, truncated = 0
Build before GEMV fusion:
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 32000, n_keep = 0, n_prompt_tokens = 9088
slot update_slots: id 0 | task 0 | n_past = 0, memory_seq_rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.225352
slot update_slots: id 0 | task 0 | n_past = 2048, memory_seq_rm [2048, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 4096, n_tokens = 2048, progress = 0.450704
slot update_slots: id 0 | task 0 | n_past = 4096, memory_seq_rm [4096, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 6144, n_tokens = 2048, progress = 0.676056
slot update_slots: id 0 | task 0 | n_past = 6144, memory_seq_rm [6144, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 8192, n_tokens = 2048, progress = 0.901408
slot update_slots: id 0 | task 0 | n_past = 8192, memory_seq_rm [8192, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 9088, n_tokens = 896, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 9088, n_tokens = 896
slot print_timing: id 0 | task 0 |
prompt eval time = 55895.17 ms / 9088 tokens ( 6.15 ms per token, 162.59 tokens per second)
eval time = 51003.89 ms / 360 tokens ( 141.68 ms per token, 7.06 tokens per second)
total time = 106899.06 ms / 9448 tokens
slot release: id 0 | task 0 | stop processing: n_past = 9447, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: request: POST /completion 192.168.XXX.XXX 200
srv log_server_r: request: GET /v1/models 192.168.XXX.XXX 200
got exception: {"code":500,"message":"Assistant response prefill is incompatible with enable_thinking.","type":"server_error"}
srv log_server_r: request: POST /v1/chat/completions 192.168.XXX.XXX 500
srv params_from_: Chat format: Hermes 2 Pro
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 24327265651
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 9447, total state size = 3395.126 MiB
srv load: - looking for better prompt, base f_keep = 0.000, sim = 0.001
srv update: - cache state: 1 prompts, 3395.126 MiB (limits: 8192.000 MiB, 32000 tokens, 32000 est)
srv update: - prompt 0x5eb758fcbe90: 9447 tokens, checkpoints: 0, 3395.126 MiB
srv get_availabl: prompt cache update took 7050.51 ms
slot launch_slot_: id 0 | task 365 | processing task
slot update_slots: id 0 | task 365 | new prompt, n_ctx_slot = 32000, n_keep = 0, n_prompt_tokens = 3633
slot update_slots: id 0 | task 365 | old: ... [gMASK]<sop> | [System note: Write one reply
slot update_slots: id 0 | task 365 | new: ... [gMASK]<sop> | <|system|>
<TASK>
Start
slot update_slots: id 0 | task 365 | 151331 151333 84329 5185 25 9641 825 9846
slot update_slots: id 0 | task 365 | 151331 151333 151335 198 3125 7384 397 3479
slot update_slots: id 0 | task 365 | n_past = 2, memory_seq_rm [2, end)
slot update_slots: id 0 | task 365 | prompt processing progress, n_past = 2050, n_tokens = 2048, progress = 0.564272
slot update_slots: id 0 | task 365 | n_past = 2050, memory_seq_rm [2050, end)
slot update_slots: id 0 | task 365 | prompt processing progress, n_past = 3633, n_tokens = 1583, progress = 1.000000
slot update_slots: id 0 | task 365 | prompt done, n_past = 3633, n_tokens = 1583
slot print_timing: id 0 | task 365 |
prompt eval time = 18311.64 ms / 3631 tokens ( 5.04 ms per token, 198.29 tokens per second)
eval time = 33573.69 ms / 345 tokens ( 97.32 ms per token, 10.28 tokens per second)
total time = 51885.33 ms / 3976 tokens
slot release: id 0 | task 365 | stop processing: n_past = 3977, truncated = 0
@wallentri88 if you only comment those lines on latest commit does it solve the issue? I wonder if doing a PR that lets you disable or enable those lanes with a env variable would be worth.
Technically those statements are needed only for TG phase to facilitate fusion.
I wonder if we should move them to graph_optimize function on the backend.
If a test needed for a possible PR with that movement I can try it.
Same issue with dual 3090 + CPU offload, latest llama.cpp (built from source), with GLM 4.5 on Linux (PP for long context is halved). Things tested:
-kvudidn't workGGML_CUDA_DISABLE_FUSION=1as env variable didn't workGGML_CUDA_DISABLE_GRAPHS=1as env variable didn't work- commenting the two lines in llama-graph.cpp worked.
Fortunately, I found this issue while searching, and @am17an found how to "fix" it. It also uses less VRAM (On CUDA1), and the first GPU (CUDA0) is more used during PP than current master without the "fix". Could it be that master distributes something to CUDA1 to compute stuff, and because CUDA1 has worse PCI bandwidth it has slower PP?
I am running nvidia driver 575.57.08, cuda 12.8 on Debian 13.
Command:
/home/pc/llama.cpp/build/bin/llama-server \
--model /home/pc/fast/GLM-4.5-GGUF/GLM-4.5-UD-Q2_K_XL-00001-of-00003.gguf \
-c 60500 \
--jinja \
--reasoning-format auto \
-fa on -ngl 99 -ub 2048 -b 8192 \
-t 14 \
--tensor_split 64,30 \
--n-cpu-moe 85 \
--no-warmup \
--cache-reuse 1024 \
--slot-save-path "/home/pc/fast/cache/llamacpp"
Numbers
Newest llama.cpp:
prompt eval time = 1234920.81 ms / 56698 tokens ( 21.78 ms per token, 45.91 tokens per second)
eval time = 1055488.89 ms / 3802 tokens ( 277.61 ms per token, 3.60 tokens per second)
total time = 2290409.71 ms / 60500 tokens
Commenting the two lines in llama-graph.cpp:
prompt eval time = 584759.42 ms / 56698 tokens ( 10.31 ms per token, 96.96 tokens per second)
eval time = 855014.94 ms / 3108 tokens ( 275.10 ms per token, 3.64 tokens per second)
total time = 1439774.36 ms / 59806 tokens
I would maybe re open the issue if possible, as this still happens on latest version.