llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

CUDA performance bug when two cards are visible and only one is used

Open cmp-nct opened this issue 9 months ago • 0 comments

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes version: 5109 (66d17c5a) built with MSVC 19.40.33811.0 for x64

Operating systems

Windows

GGML backends

CUDA

Hardware

4090+3090

Models

Qwen 14B Q5KM

Problem description & steps to reproduce

Testing: -m "Qwen.Qwen2.5-14B-Instruct-1M.Q5_K_M.gguf" -ts 1,0 -b 4000 -p "Hello world" -ngl 100 --verbose-prompt -st -n 128 --ignore-eos -fa -c 4096 -mg 0

token generation speed on my 4090 is 55 tokens/sec when using this command (to force computation on gpu 0) $env:CUDA_VISIBLE_DEVICES = "0"; This addition boosts token generation speed to 65 tokens/sec

The log shows in BOTH cases that only one card is used:

Something is slowing the cuda backend significantly down when a second GPU is visible, even if nothing is offloaded to it and it's just a secondary idle card.

First Bad Commit

No response

Relevant log output

load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors:        CUDA0 model buffer size =  9505.88 MiB
load_tensors:   CPU_Mapped model buffer size =   510.47 MiB
..........................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 4000
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (1010000) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.58 MiB
init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1
init:      CUDA0 KV buffer size =   768.00 MiB
llama_context: KV self size  =  768.00 MiB, K (f16):  384.00 MiB, V (f16):  384.00 MiB
llama_context: pipeline parallelism enabled (n_copies=4)
llama_context:      CUDA0 compute buffer size =   379.02 MiB
llama_context:  CUDA_Host compute buffer size =    42.02 MiB
llama_context: graph nodes  = 1591
llama_context: graph splits = 2

Here the difference when making only one card visible:

llama_context: KV self size  =  768.00 MiB, K (f16):  384.00 MiB, V (f16):  384.00 MiB
llama_context:      CUDA0 compute buffer size =   307.00 MiB
llama_context:  CUDA_Host compute buffer size =    18.01 MiB
llama_context: graph nodes  = 1591
llama_context: graph splits = 2

So "pipeline parallelism" is enabled and causes internal delays despite having nothing offloaded on the 2nd gpu

Update Adding manually -sm none solves the problem. But that's not something most people using llama.cpp would get to. Any GPU that is not being offloaded to should not be used by default

cmp-nct avatar Apr 09 '25 00:04 cmp-nct