text-generation-webui icon indicating copy to clipboard operation
text-generation-webui copied to clipboard

Low performance - need help

Open RndUsr123 opened this issue 9 months ago • 3 comments

Describe the bug

I think I've setup everything correctly yet I'm getting lower performance than expected but cannot figure out why.

With the exact same llama 3 GGUF model and similar settings I get around 90 tokens/s in both LM Studio and Jan but I barely get 40 in the webui. The practical reason why this happens is my gpu is not used to its fullest, capping out at 50% utilization and a fraction of its TGP. The technical reason is unknown, and is what I'm trying to figure out.

notable info:

  • the model is small enough to comfortably fit in VRAM, n-gpu-layers is set to 256 and 33/33 layers are reportedly offloaded.
  • no change in settings seems to meaningfully affect gpu usage at all, I've tested pretty much anything at this point.
  • cpu usage spikes to ~60% across all threads during generation, regardless of what the thread options are set to.
  • I'm using these arguments: --cpu-memory 0 --gpu-memory 24 --bf16.
  • I noticed 'ggml_cuda_init: CUDA_USE_TENSOR_CORES: no', which is potentially concerning (?)
  • I've re-done the setup process to ensure I didn't mess anything up the first time.

I'm at a loss and any hint is greatly appreciated

Is there an existing issue for this?

  • [x] I have searched the existing issues

Reproduction

Generate anything regardless of prompt, paramters, mode and model import settings Observe the reported performance

Screenshot

No response

Logs

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   344.44 MiB
llm_load_tensors:      CUDA0 buffer size =  5115.49 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   560.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 |

System Info

os: win 10
gpu: rtx 4090

RndUsr123 avatar Apr 28 '24 00:04 RndUsr123

--cpu-memory 0 --gpu-memory 24 --bf16 are not used in llama.cpp, it's for transformers.

A little bit of my nerdiness.

cpu-memory 0 is not needed because you have covered all the gpu layers (In your case, 33 layers is the maximum for this model) gpu-memory 24 is not needed unless you want to ogranize the VRAM capacity, or list the VRAM capacities of multiple gpus. In llama.cpp it is done via tensor_split. bf16 is the quantization for transformers, for llama.cpp is the quantization f16, you have to download/generate it yourself.

CUDA_USE_TENSOR_CORES: no may be the cause, but it's not certain, there used to be an option in the settings to enable tensors, but it's not there now. (On my rtx3060 and 4070ti, the gain is negligible or zero, I don't know how much it would give you on your rtx4090) It can be manually added to the models config(text-generation-webui/models/config-user.yaml): tensorcores: true.

About your 50% limit and TGP limit, I don't understand what you mean. If you would like the model to take up more of your VRAM, then you can download the f16 model version, it should take up about 16GB of your VRAM(non context), if you want your video card to eat your electricity non-stop - I don't even know what to suggest...

I can assume from these lines that you are using quantization like Q5 or Q6:

llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   344.44 MiB
llm_load_tensors:      CUDA0 buffer size =  5115.49 MiB

Is that so?

Do you use exactly the same gguf quantization in Jan and LM Studio?

PS: LM Studio does not use prompt, while webui chat use prompt user and character, and only then respond, this enters into the response speed calculation and makes the first shot takes longer than subsequent ones.

Alkohole avatar Apr 30 '24 07:04 Alkohole

Thanks for tuning in!

The cpu and gpu memory arguments were set as a last resort really, I understand the model is fully offloaded to VRAM but for some reason I get this very high, relatively speaking, cpu usage with half the gpu performance so I just tried :)

by 50% utilization I meant the gpu only reaches 50% reported usage in afterburner, a reliable monitoring tool, meaning my gpu is not being fully or efficiently used. This is further underscored by a low power usage just above 100W with my test model. My gpu is UVed and manually tuned so it's never going to reach the 450W stock power limit anyway, but I get ~200W worth of usage with performance hovering around 90tk/s in Jan so...

I triple checked I was using the same models (Q5_K_M in my example, like you said) but went further and tested a few more, of different sizes: each seems to have the exact same issue, regardless of parameters count or quantization. I noticed however that bigger models tend to have higher relative performance, with losses around 40% rather than 60. Furthermore, this seems to be exclusive to GGUFs (issues with llama.cpp maybe?) as their EXL2 counterpart is roughly twice as fast on my setup. I didn't conduct any proper scientific testing of this aspect, mind you, but they're indeed significantly faster and more in line with GGUFs in Jan.

As for the tensor cores thing, where am I supposed to add tensorcores: true exactly? I had no config-user.yaml, so I created it and appended to it; I also appended it to config.yaml, both at the end of the file and under the llama section, yet none of this worked.

BTW, what kind of performance are you getting out of your 4070ti + 3060, if you don't mind me asking?

RndUsr123 avatar Apr 30 '24 14:04 RndUsr123

You need to fill in the parameters of the model, after which you need to save them, after which the config-user.yaml file will appear in the folder with models, approximate view:

llama-3-8B$:
  loader: llamacpp_HF
  trust_remote_code: false
  no_use_fast: true
  cfg_cache: false
  threads: 0
  threads_batch: 0
  n_batch: 512
  no_mmap: false
  mlock: false
  no_mul_mat_q: false
  n_gpu_layers: 33
  tensor_split: ''
  n_ctx: 8192
  compress_pos_emb: 1
  alpha_value: 1
  rope_freq_base: 500000
  numa: false
  logits_all: false
  no_offload_kqv: false
  row_split: false
  tensorcores: true
  streaming_llm: true
  attention_sink_size: 5

I didn't compare the power of the 4070ti and 3060 in duet, they are installed in two different pc's, and my power supply can't handle them both, but for each of them I get an answer at about the same rate, 25-32 tokens per second from models Q8 7-8B. The rtx3060 consumes 120W during a request and at idle 19W out of 170W as GreenWithEnvy tells me.

I haven't figured out how to make Jan on linux work with local models, so I don't know at what speed Jan gives answers, but LM Studio gives about the same speed.

Also I am not aware of how EXL2 works, its forebear GPTQ worked like there is no tomorrow and he could non-stop twist my electricity meters like a vicious bully. (that was about a year ago)

Edit: I updated LM Studio from 0.2.14 to 0.2.21, and yes there is a gain, it is about 2-5 tokens for rtx3060, but to make the difference as huge as yours, there is no such.

Alkohole avatar Apr 30 '24 18:04 Alkohole