gemma.cpp
gemma.cpp copied to clipboard
num_threads doesn't seem to have any effect
./gemma
--tokenizer tokenizer.spm
--compressed_weights 2b-it-sfp.sbs
--model 2b-it
--verbosity 2
--num_threads 2
Increasing num_threads like this doesn't improve speed. Is this expected?
Just to confirm, do you have backslashes in that command line so that all are indeed passed in?
Does the binary print num_threads : 2
?
Even two threads should help, but it depends on the platform. Maybe one core is already enough to saturate memory bandwidth?
Yes, pretty sure gemma.cpp accepted the param. See the screenshot (was using 6 this time).
You are probably right; it may have hit the memory bottleneck even w/ 1 thread. Not sure how to check though.
Btw, this runs on an Android phone.
it won't always be monotonically increasing with # threads, can be quite system dependent so takes a bit of experimentation. You might want to try 2b-it-sfp which should be faster in general and may be less mem bandwidth bound.
Neat to hear it's running on an android phone! what model?
Running on Xiaomi 14
Good, so it's getting the argument value correctly. You can run STREAM to benchmark bandwidth, it also supports threading.
+1 to the SFP suggestion.
Closing for now, if there's anything that's not addressed above, feel free to chime in. Also added a small note to the README "What are some easy ways to make the model run faster?" here https://github.com/google/gemma.cpp?tab=readme-ov-file#troubleshooting-and-faqs