ollama icon indicating copy to clipboard operation
ollama copied to clipboard

How can we make model calls faster

Open userandpass opened this issue 9 months ago • 1 comments

What is the issue?

I used docker to load multiple ollama images and distribute them using nginx, which was much slower than calling the deployed model directly

OS

Linux

GPU

Nvidia

CPU

No response

Ollama version

0.1.34

userandpass avatar May 17 '24 08:05 userandpass

After I added the "keep_alive": "24h" parameter, after a while I executed the nvidia-smi command, there was no ollama on the card, so I needed to call the interface to display it

userandpass avatar May 17 '24 09:05 userandpass

Looks like this issue slipped through the cracks.

I don't quite understand what problem you're having. It sounds like you're running multiple ollama containers, and load balancing them with an nginx in front. When you say "much slower" are you talking about tokens per second, latency, throughput, something else? I think you're indicating that Ollama itself is working properly, but you're having trouble setting up a load balancer in front of it without introducing lag?

dhiltgen avatar Oct 16 '24 16:10 dhiltgen

Let's close the issue. We can reopen if it's still a problem.

pdevine avatar Jan 12 '25 00:01 pdevine