private-gpt Ollama QOL settings

Ollama QOL settings

Open Robinsane opened this issue 10 months ago • 3 comments

ollama settings: ability to keep LLM in memory for a longer time + ability to run ollama embedding on another instance

We've got a butter smooth production setup as of right now by doing the following things:

Run the embedding on a separate Ollama instance (docker container) By doing this we avoid waiting times needed where Ollama swaps the LLM in (V)RAM to the embedding model in VRAM and back
By explicitely stating with each request we want the used model to stay in (V)RAM for another 6 hours By default Ollama makes a model leave (V)RAM after 5 minutes of not being used. This caused long wait times to reload the LLM after > 5 minutes (running a 20 GB quant at the moment)

(3. ingest_mode: pipeline)

I hope this PR can make others as happy as I am right now ;)

Mar 27 '24 16:03 Robinsane