gemma.cpp icon indicating copy to clipboard operation
gemma.cpp copied to clipboard

Have there been any performance comparisons between gemma.cpp and llama.cpp?

Open shiwenloong opened this issue 10 months ago • 3 comments

llama.cpp is widely recognized for deploying LLM, including Gemma. Have there been any performance comparisons between gemma.cpp and llama.cpp?

shiwenloong avatar Apr 12 '24 09:04 shiwenloong

We view gemma.cpp as a platform for experimentation rather than focusing on deployment, but it's still intended to be reasonably good. We haven't done such benchmarks yet, but it would be very interesting to see.

FYI our prefill/PP is relatively slow until we are properly handling batch>1, with matmul, which is on the near-term roadmap. I'd think our multithreaded decode/TG is competitive :)

jan-wassenberg avatar Apr 12 '24 09:04 jan-wassenberg

I built llama.cpp main from Git HEAD via cmake -DLLAMA_STATIC=ON -DCMAKE_BUILD_TYPE=Release .. && make -j main. The resulting decode tok/sec for llama.cpp on a Zen4 server: 2.51B f16 gguf_gemma_fp16-unsloth.F16.gguf: 41.6 for 16 threads 8.54B f32 7b_it_v1p1.gguf: 5.1, 6.4, 7.04 (avg 6.2) for 16 threads (4.2 for 32 threads)

If we scale the f16 41.6 by 2.51/8.54/2, that's 6.1 tok/sec, so a bit short of the actual f32 and thus not always memory bound. For gemma.cpp we get 25.9, 26.1, 26.3 (avg 26.1) for our fp8 'SFP' weights, 80 threads, and a short factual query.

That's 1.04x vs the f32 6.2 (optimistically) scaled 4x. Thus it appears that gemma.cpp currently has a slight advantage on Zen4 for short TG. We anticipate further optimizations, stay tuned.

jan-wassenberg avatar May 07 '24 15:05 jan-wassenberg

Thanks for the detailed response! It's very helpful to see the performance numbers for both llama.cpp and gemma.cpp on Zen4, and I'm looking forward to the future updates.

shiwenloong avatar May 16 '24 07:05 shiwenloong

Current benchmark results for gemma2-9b on 2xSKX with 36 threads, 330 token prompt: Weights are 8-bit SFP for gemma, and int8 for the others.

llamafile    prefill 41.47 tps, decode 4.86 tps
gemma.cpp    prefill 49.42 tps, decode 8.63 tps
llama.cpp    prefill 35.07 tps, decode 5.61 tps

jan-wassenberg avatar Aug 16 '24 12:08 jan-wassenberg