llama.cpp
llama.cpp copied to clipboard
Eval bug: llama cpp becomes slower as the number of threads -t increases
Name and Version
./llama-server --version load_backend: loaded CPU backend from ./libggml-cpu-haswell.so version: 4457 (ee7136c6) built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
CPU
Hardware
Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 6 On-line CPU(s) list: 0-5 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz CPU family: 6 Model: 85 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 6 Stepping: 7 BogoMIPS: 4399.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities Hypervisor vendor: KVM Virtualization type: full L1d cache: 192 KiB (6 instances) L1i cache: 192 KiB (6 instances) L2 cache: 24 MiB (6 instances) L3 cache: 96 MiB (6 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-5
Models
Qwen 2.5 0.5b instruct q4_k_s
Problem description & steps to reproduce
I have a 6 vCPU vps
llama cpp becomes slower as the number of threads -t
increases. The best performances is when -t is set to 1. How could I solve this problem.
I am running llamacpp using a docker file with the configs below:
FROM ghcr.io/ggerganov/llama.cpp:server
COPY . /opt/.
CMD [ "-m","/opt/models/qwen-2.5_0.5b-chat_2024-01-13_21-03.Q4_K_S.gguf", "-c","256", "--host", "0.0.0.0", "--port", "8080","-fa", "-t","1","--mlock","-b","256","--no-escape"]
First Bad Commit
No response
Relevant log output
| slot launch_slot_: id 0 | task 295 | processing task
| slot update_slots: id 0 | task 295 | new prompt, n_ctx_slot = 256, n_keep = 0, n_prompt_tokens = 71
| slot update_slots: id 0 | task 295 | kv cache rm [38, end)
| slot update_slots: id 0 | task 295 | prompt processing progress, n_past = 71, n_tokens = 33, progress = 0.464789
| slot update_slots: id 0 | task 295 | prompt done, n_past = 71, n_tokens = 33
| slot release: id 0 | task 295 | stop processing: n_past = 154, truncated = 0
| slot print_timing: id 0 | task 295 |
| prompt eval time = 2559.11 ms / 33 tokens ( 77.55 ms per token, 12.90 tokens per second)
| eval time = 15404.57 ms / 84 tokens ( 183.39 ms per token, 5.45 tokens per second)
| total time = 17963.69 ms / 117 tokens
| srv update_slots: all slots are idle
| request: POST /v1/chat/completions 172.18.0.3 200