gemma.cpp
gemma.cpp copied to clipboard
Have there been any performance comparisons between gemma.cpp and llama.cpp?
llama.cpp is widely recognized for deploying LLM, including Gemma. Have there been any performance comparisons between gemma.cpp and llama.cpp?
We view gemma.cpp as a platform for experimentation rather than focusing on deployment, but it's still intended to be reasonably good. We haven't done such benchmarks yet, but it would be very interesting to see.
FYI our prefill/PP is relatively slow until we are properly handling batch>1, with matmul, which is on the near-term roadmap. I'd think our multithreaded decode/TG is competitive :)
I built llama.cpp main
from Git HEAD via cmake -DLLAMA_STATIC=ON -DCMAKE_BUILD_TYPE=Release .. && make -j main
.
The resulting decode tok/sec for llama.cpp on a Zen4 server:
2.51B f16 gguf_gemma_fp16-unsloth.F16.gguf: 41.6 for 16 threads
8.54B f32 7b_it_v1p1.gguf: 5.1, 6.4, 7.04 (avg 6.2) for 16 threads (4.2 for 32 threads)
If we scale the f16 41.6 by 2.51/8.54/2, that's 6.1 tok/sec, so a bit short of the actual f32 and thus not always memory bound. For gemma.cpp we get 25.9, 26.1, 26.3 (avg 26.1) for our fp8 'SFP' weights, 80 threads, and a short factual query.
That's 1.04x vs the f32 6.2 (optimistically) scaled 4x. Thus it appears that gemma.cpp currently has a slight advantage on Zen4 for short TG. We anticipate further optimizations, stay tuned.
Thanks for the detailed response! It's very helpful to see the performance numbers for both llama.cpp and gemma.cpp on Zen4, and I'm looking forward to the future updates.
Current benchmark results for gemma2-9b on 2xSKX with 36 threads, 330 token prompt: Weights are 8-bit SFP for gemma, and int8 for the others.
llamafile prefill 41.47 tps, decode 4.86 tps
gemma.cpp prefill 49.42 tps, decode 8.63 tps
llama.cpp prefill 35.07 tps, decode 5.61 tps