private-gpt icon indicating copy to clipboard operation
private-gpt copied to clipboard

Ollama QOL settings

Open Robinsane opened this issue 2 years ago • 3 comments

ollama settings: ability to keep LLM in memory for a longer time + ability to run ollama embedding on another instance

We've got a butter smooth production setup as of right now by doing the following things:

  1. Run the embedding on a separate Ollama instance (docker container) By doing this we avoid waiting times needed where Ollama swaps the LLM in (V)RAM to the embedding model in VRAM and back

  2. By explicitely stating with each request we want the used model to stay in (V)RAM for another 6 hours By default Ollama makes a model leave (V)RAM after 5 minutes of not being used. This caused long wait times to reload the LLM after > 5 minutes (running a 20 GB quant at the moment)

(3. ingest_mode: pipeline)

I hope this PR can make others as happy as I am right now ;)

Robinsane avatar Mar 27 '24 16:03 Robinsane

If anyone feels like fixing failing mypy tests for private_gpt/components/llm/custom/ollama.py feel free

I got some errors trying to import stuff only for annotations for mypy... I feel like it's not really necessary since you can just look at the Ollama superclass to understand it all, right?

Robinsane avatar Mar 27 '24 16:03 Robinsane

Thanks for the suggestion @dbzoo

extra: If the default keep_alive is left unchanged, I don't wrap, leaving the requests just like they used to be :)

Robinsane avatar Mar 28 '24 07:03 Robinsane

Thanks for the suggestion @dbzoo

extra: If the default keep_alive is left unchanged, I don't wrap, leaving the requests just like they used to be :)

Thanks for taking that suggestion to heart. The code looks better for it, too. Nice job.

dbzoo avatar Mar 28 '24 12:03 dbzoo