llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Eval bug: very slow inference on DeepSeek-R1-Distill-Qwen-32B

Open lexasub opened this issue 1 month ago • 17 comments

Name and Version

from master(22 jan)

Operating systems

Linux

GGML backends

CUDA

Hardware

3060 on main pc +(3060+3060+1660ti/sup+1660ti/sup) on other pc

Models

DeepSeek-R1-Distill-Qwen-32B_q8

Problem description & steps to reproduce

hello, i running inference on https://huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-32B (locally ggufed to q8) on 2 pc over network (connection via 100 mb wan (iperf is good) via vpn), i get very slow inference with low cpu/gpu usage on 2 pc's . ./llama-server --host 192.168.2.109 --port 8080 -m unsloth_DeepSeek-R1-Distill-Qwen-32B/DeepSeek-R1-Distill-Qwen-32B-Q8_0.gguf --rpc 10.2.0.5:1000,10.2.0.5:1001,10.2.0.5:1002,10.2.0.5:1003 -ngl 99999 main host: Image remote host: Image

First Bad Commit

No response

Relevant log output

llama_init_from_model: n_seq_max     = 1                                                                                    
llama_init_from_model: n_ctx         = 4096                                                                                 
llama_init_from_model: n_ctx_per_seq = 4096                                                                                 
llama_init_from_model: n_batch       = 2048                                                                                 
llama_init_from_model: n_ubatch      = 512                                                                                  
llama_init_from_model: flash_attn    = 0                                                                                    
llama_init_from_model: freq_base     = 1000000.0                                                                            
llama_init_from_model: freq_scale    = 1                                                                                    
llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized   
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 64, can_shift = 1               
llama_kv_cache_init: RPC[10.2.0.5:1000] KV buffer size =   256.00 MiB                                                       
llama_kv_cache_init: RPC[10.2.0.5:1001] KV buffer size =   272.00 MiB                                                       
llama_kv_cache_init: RPC[10.2.0.5:1002] KV buffer size =   128.00 MiB                                                       
llama_kv_cache_init: RPC[10.2.0.5:1003] KV buffer size =    96.00 MiB                                                       
llama_kv_cache_init:      CUDA0 KV buffer size =   272.00 MiB 
llama_init_from_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB  
llama_init_from_model:        CPU  output buffer size =     0.58 MiB                                                        
llama_init_from_model:      CUDA0 compute buffer size =   368.00 MiB                                                        
llama_init_from_model: RPC[10.2.0.5:1000] compute buffer size =   368.00 MiB                                                
llama_init_from_model: RPC[10.2.0.5:1001] compute buffer size =   368.00 MiB                                                
llama_init_from_model: RPC[10.2.0.5:1002] compute buffer size =   368.00 MiB                                                
llama_init_from_model: RPC[10.2.0.5:1003] compute buffer size =   368.00 MiB                                                
llama_init_from_model:  CUDA_Host compute buffer size =    18.01 MiB                                                        
llama_init_from_model: graph nodes  = 2246                                                                                  
llama_init_from_model: graph splits = 6                                                                                     
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096                                                      
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)                  
srv          init: initializing slots, n_slots = 1                                                                          
slot         init: id  0 | task -1 | new slot n_ctx_slot = 4096        



slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 10
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 10, n_tokens = 10, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 10, n_tokens = 10
slot      release: id  0 | task 0 | stop processing: n_past = 130, truncated = 0
slot print_timing: id  0 | task 0 | 
prompt eval time =    2635.52 ms /    10 tokens (  263.55 ms per token,     3.79 tokens per second)
       eval time =  197690.16 ms /   121 tokens ( 1633.80 ms per token,     0.61 tokens per second)
      total time =  200325.68 ms /   131 tokens

lexasub avatar Jan 22 '25 21:01 lexasub