ollama Continuous batching support

Does Ollama support continuous batching for concurrent requests? I couldn't find anything in the documentation.

Dec 06 '23 06:12 Huvinesh-Rajendran-12

It doesn't.

Dec 06 '23 21:12 easp

llama.cpp (which is the engine at the base of Ollama) does indeed support it, I'd also like for a configuration parameter in Ollama to be set to enable continuous batching.

Ref: https://github.com/ggerganov/llama.cpp/discussions/3471

Dec 07 '23 10:12 trenta3

@trenta3, how do we turn it in in the llama.cpp case?

Dec 08 '23 21:12 sodre

pass in the -cb flag when running the server

Dec 10 '23 06:12 Huvinesh-Rajendran-12

Yes indeed, does someone know if there is a way in ollama to pass options directly to the underlying llama.cpp?

Dec 14 '23 22:12 trenta3

The issue is less about passing the parameters down and more about ensuring that the different connection at the Ollama side use different slots of llama.cpp.

Dec 19 '23 02:12 sodre

Hey, just to start the conversation: how about adding a new endpoint to Ollama that can handle batching? After we see it's working well, we could make it part of the main generate endpoint.

Like, EricLLM uses a queue and an inference loop for batching. I think it's a good and easy way to do it. People could start using it and if something comes up, we still could switch to a more sophisticated solution. I believe this would be a major feature for Ollama! EricLLM: https://github.com/epolewski/EricLLM/blob/main/ericLLM.py

What do you think about the approach?