ollama icon indicating copy to clipboard operation
ollama copied to clipboard

Continuous batching support

Open Huvinesh-Rajendran-12 opened this issue 1 year ago • 11 comments

Does Ollama support continuous batching for concurrent requests? I couldn't find anything in the documentation.

Huvinesh-Rajendran-12 avatar Dec 06 '23 06:12 Huvinesh-Rajendran-12

It doesn't.

easp avatar Dec 06 '23 21:12 easp

llama.cpp (which is the engine at the base of Ollama) does indeed support it, I'd also like for a configuration parameter in Ollama to be set to enable continuous batching.

Ref: https://github.com/ggerganov/llama.cpp/discussions/3471

trenta3 avatar Dec 07 '23 10:12 trenta3

@trenta3, how do we turn it in in the llama.cpp case?

sodre avatar Dec 08 '23 21:12 sodre

pass in the -cb flag when running the server

Huvinesh-Rajendran-12 avatar Dec 10 '23 06:12 Huvinesh-Rajendran-12

Yes indeed, does someone know if there is a way in ollama to pass options directly to the underlying llama.cpp?

trenta3 avatar Dec 14 '23 22:12 trenta3

The issue is less about passing the parameters down and more about ensuring that the different connection at the Ollama side use different slots of llama.cpp.

sodre avatar Dec 19 '23 02:12 sodre

Hey, just to start the conversation: how about adding a new endpoint to Ollama that can handle batching? After we see it's working well, we could make it part of the main generate endpoint.

Like, EricLLM uses a queue and an inference loop for batching. I think it's a good and easy way to do it. People could start using it and if something comes up, we still could switch to a more sophisticated solution. I believe this would be a major feature for Ollama! EricLLM: https://github.com/epolewski/EricLLM/blob/main/ericLLM.py

What do you think about the approach?

jkuehn avatar Feb 27 '24 10:02 jkuehn

For me it would be great to switch on the continuous batching via the command line or env car.

Then I could use the existing open air end points.

Can anyone explain how this works with llama.cpp?

9876691 avatar Feb 27 '24 17:02 9876691

pass in the -cb flag when running the server

@9876691 follow this.

Huvinesh-Rajendran-12 avatar Feb 28 '24 04:02 Huvinesh-Rajendran-12

I would also be interested in this functionality.

dantheman0207 avatar Mar 20 '24 11:03 dantheman0207

This would be a great feature to have and would increase the utility of Ollama by an order of magnitude.

MarcellM01 avatar Mar 24 '24 08:03 MarcellM01

Any news on this one? Has anyone tried the saturation of the ollama server?

pawelgnatowski avatar Jul 02 '24 15:07 pawelgnatowski

Hi, is there any updates? Thanks!

fzyzcjy avatar Aug 23 '24 06:08 fzyzcjy