llama.cpp
llama.cpp copied to clipboard
Eval bug: very slow inference on DeepSeek-R1-Distill-Qwen-32B
Name and Version
from master(22 jan)
Operating systems
Linux
GGML backends
CUDA
Hardware
3060 on main pc +(3060+3060+1660ti/sup+1660ti/sup) on other pc
Models
DeepSeek-R1-Distill-Qwen-32B_q8
Problem description & steps to reproduce
hello, i running inference on https://huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-32B (locally ggufed to q8) on 2 pc over network (connection via 100 mb wan (iperf is good) via vpn), i get very slow inference with low cpu/gpu usage on 2 pc's .
./llama-server --host 192.168.2.109 --port 8080 -m unsloth_DeepSeek-R1-Distill-Qwen-32B/DeepSeek-R1-Distill-Qwen-32B-Q8_0.gguf --rpc 10.2.0.5:1000,10.2.0.5:1001,10.2.0.5:1002,10.2.0.5:1003 -ngl 99999
main host:
remote host:
First Bad Commit
No response
Relevant log output
llama_init_from_model: n_seq_max = 1
llama_init_from_model: n_ctx = 4096
llama_init_from_model: n_ctx_per_seq = 4096
llama_init_from_model: n_batch = 2048
llama_init_from_model: n_ubatch = 512
llama_init_from_model: flash_attn = 0
llama_init_from_model: freq_base = 1000000.0
llama_init_from_model: freq_scale = 1
llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 64, can_shift = 1
llama_kv_cache_init: RPC[10.2.0.5:1000] KV buffer size = 256.00 MiB
llama_kv_cache_init: RPC[10.2.0.5:1001] KV buffer size = 272.00 MiB
llama_kv_cache_init: RPC[10.2.0.5:1002] KV buffer size = 128.00 MiB
llama_kv_cache_init: RPC[10.2.0.5:1003] KV buffer size = 96.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 272.00 MiB
llama_init_from_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB
llama_init_from_model: CPU output buffer size = 0.58 MiB
llama_init_from_model: CUDA0 compute buffer size = 368.00 MiB
llama_init_from_model: RPC[10.2.0.5:1000] compute buffer size = 368.00 MiB
llama_init_from_model: RPC[10.2.0.5:1001] compute buffer size = 368.00 MiB
llama_init_from_model: RPC[10.2.0.5:1002] compute buffer size = 368.00 MiB
llama_init_from_model: RPC[10.2.0.5:1003] compute buffer size = 368.00 MiB
llama_init_from_model: CUDA_Host compute buffer size = 18.01 MiB
llama_init_from_model: graph nodes = 2246
llama_init_from_model: graph splits = 6
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv init: initializing slots, n_slots = 1
slot init: id 0 | task -1 | new slot n_ctx_slot = 4096
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 10
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 10, n_tokens = 10, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 10, n_tokens = 10
slot release: id 0 | task 0 | stop processing: n_past = 130, truncated = 0
slot print_timing: id 0 | task 0 |
prompt eval time = 2635.52 ms / 10 tokens ( 263.55 ms per token, 3.79 tokens per second)
eval time = 197690.16 ms / 121 tokens ( 1633.80 ms per token, 0.61 tokens per second)
total time = 200325.68 ms / 131 tokens