ollama
ollama copied to clipboard
How can we make model calls faster
What is the issue?
I used docker to load multiple ollama images and distribute them using nginx, which was much slower than calling the deployed model directly
OS
Linux
GPU
Nvidia
CPU
No response
Ollama version
0.1.34
After I added the "keep_alive": "24h" parameter, after a while I executed the nvidia-smi command, there was no ollama on the card, so I needed to call the interface to display it
Looks like this issue slipped through the cracks.
I don't quite understand what problem you're having. It sounds like you're running multiple ollama containers, and load balancing them with an nginx in front. When you say "much slower" are you talking about tokens per second, latency, throughput, something else? I think you're indicating that Ollama itself is working properly, but you're having trouble setting up a load balancer in front of it without introducing lag?
Let's close the issue. We can reopen if it's still a problem.