ollama How can we make model calls faster

How can we make model calls faster

Open userandpass opened this issue 9 months ago • 1 comments

What is the issue?

I used docker to load multiple ollama images and distribute them using nginx, which was much slower than calling the deployed model directly

OS

Linux

GPU

Nvidia

CPU

No response

Ollama version

0.1.34

May 17 '24 08:05 userandpass

After I added the "keep_alive": "24h" parameter, after a while I executed the nvidia-smi command, there was no ollama on the card, so I needed to call the interface to display it

May 17 '24 09:05 userandpass

Looks like this issue slipped through the cracks.

I don't quite understand what problem you're having. It sounds like you're running multiple ollama containers, and load balancing them with an nginx in front. When you say "much slower" are you talking about tokens per second, latency, throughput, something else? I think you're indicating that Ollama itself is working properly, but you're having trouble setting up a load balancer in front of it without introducing lag?

Oct 16 '24 16:10 dhiltgen

Let's close the issue. We can reopen if it's still a problem.

Jan 12 '25 00:01 pdevine

ollama ollama copied to clipboard

How can we make model calls faster

What is the issue?

OS

GPU

CPU

Ollama version

ollama
ollama copied to clipboard