llama.cpp
llama.cpp copied to clipboard
perf: Investigate performance discrepancy with llama-rs - 1.5x-2x slower
Preliminary results show that llama.cpp is 1.5x-2x slower than llama-rs. They were both checked to compile with the same arch flags and use the same gnu toolchain.
Summary (on Vicuna 13B, 2048 ctx size, 256 predict tokens):
llama.cpp: 430.44 ms per runllama-rs: per_token_duration: 272.793ms
An interesting observation is that CPU util is lower on llama-rs.
System Info:
llama.cpp
> make
I llama.cpp build info:
I UNAME_S: Linux
I UNAME_P: x86_64
I UNAME_M: x86_64
I CFLAGS: -I. -O3 -DNDEBUG -std=c11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -march=native -mtune=native
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:
I CC: cc (Ubuntu 9.5.0-1ubuntu1~22.04) 9.5.0
I CXX: g++ (Ubuntu 9.5.0-1ubuntu1~22.04) 9.5.0
./main
system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
llama-rs
warning: Using gnu
warning: Using MAVX
warning: Using AVX2
warning: Using FMA
warning: Using F16C
warning: Using SSE3
No BLAS.
Notes: llama-rs bench runs on my branch.
I think I made a regression in https://github.com/ggerganov/llama.cpp/commit/c3ac702e5ee3533457e0489df4906ee112fe88e7
Can you check that reverting it solves the issue?
I checked out https://github.com/ggerganov/llama.cpp/commit/9d634ef452d0fc24fcd49592952d13d0ab0f41b7. There was no improvement, and in fact, a regression.
You can see visually how much slower llama.cpp is: llama-rs, llama.cpp
Compiling with openblas makes things even worse: 560.66 ms per run
Oops, I found the issue: using hyperthreading (https://github.com/ggerganov/llama.cpp/issues/34#issuecomment-1465313724):
Using 8 threads instead of 16: -t 8 -n 128
llama_print_timings: load time = 1421.72 ms
llama_print_timings: sample time = 74.37 ms / 128 runs ( 0.58 ms per run)
llama_print_timings: prompt eval time = 720.68 ms / 4 tokens ( 180.17 ms per token)
llama_print_timings: eval time = 34420.26 ms / 127 runs ( 271.03 ms per run)
llama_print_timings: total time = 35918.93 ms
I think we should make the default number of real CPU cores. @ggerganov
Pretty sure the default version of the code uses like 4. Or at least the initial examples.
Pretty sure the default version of the code uses like 4. Or at least the initial examples.
On linux, the default is number of logical threads: https://github.com/ggerganov/llama.cpp/blob/e7f6997f897a18b6372a6460e25c5f89e1469f1d/examples/common.cpp#L35
Oh damn. That's why people are complaining when they use all their threads 😂 I guess that's the one bonus Windows has in this case.
On Thu, Apr 13, 2023, 04:47 jon-chuang @.***> wrote:
Pretty sure the default version of the code uses like 4. Or at least the initial examples.
On linux, the default is number of logical threads:
https://github.com/ggerganov/llama.cpp/blob/e7f6997f897a18b6372a6460e25c5f89e1469f1d/examples/common.cpp#L35
— Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/issues/932#issuecomment-1506586154, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYMC3AAWX4UBNPI22KGQQZDXA64TJANCNFSM6AAAAAAW4OYSPU . You are receiving this because you commented.Message ID: <ggerganov/llama. @.***>