llama.cpp Eval bug: llama cpp becomes slower as the number of threads -t increases

Eval bug: llama cpp becomes slower as the number of threads -t increases

Open wathuta opened this issue 1 month ago • 2 comments

Name and Version

./llama-server --version load_backend: loaded CPU backend from ./libggml-cpu-haswell.so version: 4457 (ee7136c6) built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

CPU

Hardware

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 6 On-line CPU(s) list: 0-5 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz CPU family: 6 Model: 85 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 6 Stepping: 7 BogoMIPS: 4399.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities Hypervisor vendor: KVM Virtualization type: full L1d cache: 192 KiB (6 instances) L1i cache: 192 KiB (6 instances) L2 cache: 24 MiB (6 instances) L3 cache: 96 MiB (6 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-5

Models

Qwen 2.5 0.5b instruct q4_k_s

Problem description & steps to reproduce

I have a 6 vCPU vps llama cpp becomes slower as the number of threads -t increases. The best performances is when -t is set to 1. How could I solve this problem.

I am running llamacpp using a docker file with the configs below:

FROM ghcr.io/ggerganov/llama.cpp:server

COPY . /opt/.

CMD [ "-m","/opt/models/qwen-2.5_0.5b-chat_2024-01-13_21-03.Q4_K_S.gguf", "-c","256", "--host", "0.0.0.0", "--port", "8080","-fa", "-t","1","--mlock","-b","256","--no-escape"]

First Bad Commit

No response

Relevant log output

| slot launch_slot_: id  0 | task 295 | processing task
| slot update_slots: id  0 | task 295 | new prompt, n_ctx_slot = 256, n_keep = 0, n_prompt_tokens = 71
| slot update_slots: id  0 | task 295 | kv cache rm [38, end)
| slot update_slots: id  0 | task 295 | prompt processing progress, n_past = 71, n_tokens = 33, progress = 0.464789
| slot update_slots: id  0 | task 295 | prompt done, n_past = 71, n_tokens = 33
| slot      release: id  0 | task 295 | stop processing: n_past = 154, truncated = 0
| slot print_timing: id  0 | task 295 | 
| prompt eval time =    2559.11 ms /    33 tokens (   77.55 ms per token,    12.90 tokens per second)
|        eval time =   15404.57 ms /    84 tokens (  183.39 ms per token,     5.45 tokens per second)
|       total time =   17963.69 ms /   117 tokens
| srv  update_slots: all slots are idle
| request: POST /v1/chat/completions 172.18.0.3 200

Jan 15 '25 04:01 wathuta

llama.cpp llama.cpp copied to clipboard

Eval bug: llama cpp becomes slower as the number of threads -t increases

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

llama.cpp
llama.cpp copied to clipboard