llama.cpp perf: Investigate performance discrepancy with llama-rs

Preliminary results show that llama.cpp is 1.5x-2x slower than llama-rs. They were both checked to compile with the same arch flags and use the same gnu toolchain.

Summary (on Vicuna 13B, 2048 ctx size, 256 predict tokens):

llama.cpp: 430.44 ms per run
llama-rs: per_token_duration: 272.793ms

Detailed results

An interesting observation is that CPU util is lower on llama-rs.

System Info:

llama.cpp

> make
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -march=native -mtune=native
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:  
I CC:       cc (Ubuntu 9.5.0-1ubuntu1~22.04) 9.5.0
I CXX:      g++ (Ubuntu 9.5.0-1ubuntu1~22.04) 9.5.0

./main
system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

llama-rs

warning: Using gnu
warning: Using MAVX
warning: Using AVX2
warning: Using FMA
warning: Using F16C
warning: Using SSE3

No BLAS.

Notes: llama-rs bench runs on my branch.

Apr 13 '23 02:04 jon-chuang

I think I made a regression in https://github.com/ggerganov/llama.cpp/commit/c3ac702e5ee3533457e0489df4906ee112fe88e7

Can you check that reverting it solves the issue?

Apr 13 '23 03:04 ggerganov

I checked out https://github.com/ggerganov/llama.cpp/commit/9d634ef452d0fc24fcd49592952d13d0ab0f41b7. There was no improvement, and in fact, a regression.

Apr 13 '23 03:04 jon-chuang

You can see visually how much slower llama.cpp is: llama-rs, llama.cpp

Compiling with openblas makes things even worse: 560.66 ms per run

Apr 13 '23 03:04 jon-chuang

Oops, I found the issue: using hyperthreading (https://github.com/ggerganov/llama.cpp/issues/34#issuecomment-1465313724):

Using 8 threads instead of 16: -t 8 -n 128

llama_print_timings:        load time =  1421.72 ms
llama_print_timings:      sample time =    74.37 ms /   128 runs   (    0.58 ms per run)
llama_print_timings: prompt eval time =   720.68 ms /     4 tokens (  180.17 ms per token)
llama_print_timings:        eval time = 34420.26 ms /   127 runs   (  271.03 ms per run)
llama_print_timings:       total time = 35918.93 ms

Apr 13 '23 04:04 jon-chuang

I think we should make the default number of real CPU cores. @ggerganov

Apr 13 '23 04:04 jon-chuang

Pretty sure the default version of the code uses like 4. Or at least the initial examples.

Apr 13 '23 08:04 MillionthOdin16

Pretty sure the default version of the code uses like 4. Or at least the initial examples.

On linux, the default is number of logical threads: https://github.com/ggerganov/llama.cpp/blob/e7f6997f897a18b6372a6460e25c5f89e1469f1d/examples/common.cpp#L35

Apr 13 '23 08:04 jon-chuang

Oh damn. That's why people are complaining when they use all their threads 😂 I guess that's the one bonus Windows has in this case.

On Thu, Apr 13, 2023, 04:47 jon-chuang @.***> wrote:

Pretty sure the default version of the code uses like 4. Or at least the initial examples.

On linux, the default is number of logical threads:

https://github.com/ggerganov/llama.cpp/blob/e7f6997f897a18b6372a6460e25c5f89e1469f1d/examples/common.cpp#L35

— Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/issues/932#issuecomment-1506586154, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYMC3AAWX4UBNPI22KGQQZDXA64TJANCNFSM6AAAAAAW4OYSPU . You are receiving this because you commented.Message ID: <ggerganov/llama. @.***>

Apr 13 '23 08:04 MillionthOdin16

llama.cpp
llama.cpp copied to clipboard

perf: Investigate performance discrepancy with llama-rs - 1.5x-2x slower

llama.cpp llama.cpp copied to clipboard

perf: Investigate performance discrepancy with llama-rs - 1.5x-2x slower

llama.cpp
llama.cpp copied to clipboard