Being able to set the number of CPU cores/threads to use for inference in Alpaca would increase speed enormously for many cpu, no gpu, ram only users

Open Stroemie opened this issue 9 months ago • 0 comments

Is it possible to set the number of CPU cores/threads for inference in Alpaca?

Situation: Running Alpaca and Ollama on CPU (12 cores, multi-treading disabled) and RAM (2 channels DDR5 5600MT/s 48GB each) only and no GPU works fine.

However, if only 4 cores are used for model inference the inference rate in tokens per second is much higher compared to using all 12 cores.

Here an example chat question, inference done in over 4 minutes, with all 16 cores enabled in Alpaca:

and the same inference with only 4 cores enabled in follamac, done in 40 seconds:

The bottleneck is the memory bandwidth and not the number of cores here, I guess. When too many cores are activated they are causing memory channel traffic jams.

Describe the solution you'd like Add the line: PARAMETER num_thread in the Alpaca Edit Instance window, in a similar way as the line to set model temperature.

Describe alternatives you've considered The alternative is to use follamac or ollama llama.cpp directly and/or edit model files but I like to stay a dum llm user and enjoy the ease of use of Alpaca!.

Alpaca has a nicer chat, more file upload possibilities and... ongoing improvements with every version!

Please keep up this good work!

Additional wishes

Would be great to have a little "inference data" tag in the Alpaca chat window with each chat prompt/answer (similar as the "thoughts" tag) with:

the number of prompt tokens
number of evaluation tokens
inference speed in tokens per second
model context size used
etc

Thank you for the nice app!

PS I also saw a similar request in #360

Mar 12 '25 16:03 Stroemie