ollama
                                
                                 ollama copied to clipboard
                                
                                    ollama copied to clipboard
                            
                            
                            
                        Continuous batching support
Does Ollama support continuous batching for concurrent requests? I couldn't find anything in the documentation.
It doesn't.
llama.cpp (which is the engine at the base of Ollama) does indeed support it, I'd also like for a configuration parameter in Ollama to be set to enable continuous batching.
Ref: https://github.com/ggerganov/llama.cpp/discussions/3471
@trenta3, how do we turn it in in the llama.cpp case?
pass in the -cb flag when running the server
Yes indeed, does someone know if there is a way in ollama to pass options directly to the underlying llama.cpp?
The issue is less about passing the parameters down and more about ensuring that the different connection at the Ollama side use different slots of llama.cpp.
Hey, just to start the conversation: how about adding a new endpoint to Ollama that can handle batching? After we see it's working well, we could make it part of the main generate endpoint.
Like, EricLLM uses a queue and an inference loop for batching. I think it's a good and easy way to do it. People could start using it and if something comes up, we still could switch to a more sophisticated solution. I believe this would be a major feature for Ollama! EricLLM: https://github.com/epolewski/EricLLM/blob/main/ericLLM.py
What do you think about the approach?
For me it would be great to switch on the continuous batching via the command line or env car.
Then I could use the existing open air end points.
Can anyone explain how this works with llama.cpp?
pass in the -cb flag when running the server
@9876691 follow this.
I would also be interested in this functionality.
This would be a great feature to have and would increase the utility of Ollama by an order of magnitude.
Any news on this one? Has anyone tried the saturation of the ollama server?
Hi, is there any updates? Thanks!