private-gpt
private-gpt copied to clipboard
Ollama QOL settings
ollama settings: ability to keep LLM in memory for a longer time + ability to run ollama embedding on another instance
We've got a butter smooth production setup as of right now by doing the following things:
-
Run the embedding on a separate Ollama instance (docker container) By doing this we avoid waiting times needed where Ollama swaps the LLM in (V)RAM to the embedding model in VRAM and back
-
By explicitely stating with each request we want the used model to stay in (V)RAM for another 6 hours By default Ollama makes a model leave (V)RAM after 5 minutes of not being used. This caused long wait times to reload the LLM after > 5 minutes (running a 20 GB quant at the moment)
(3. ingest_mode: pipeline)
I hope this PR can make others as happy as I am right now ;)