Ollama QOL settings
ollama settings: ability to keep LLM in memory for a longer time + ability to run ollama embedding on another instance
We've got a butter smooth production setup as of right now by doing the following things:
-
Run the embedding on a separate Ollama instance (docker container) By doing this we avoid waiting times needed where Ollama swaps the LLM in (V)RAM to the embedding model in VRAM and back
-
By explicitely stating with each request we want the used model to stay in (V)RAM for another 6 hours By default Ollama makes a model leave (V)RAM after 5 minutes of not being used. This caused long wait times to reload the LLM after > 5 minutes (running a 20 GB quant at the moment)
(3. ingest_mode: pipeline)
I hope this PR can make others as happy as I am right now ;)
If anyone feels like fixing failing mypy tests for private_gpt/components/llm/custom/ollama.py feel free
I got some errors trying to import stuff only for annotations for mypy... I feel like it's not really necessary since you can just look at the Ollama superclass to understand it all, right?
Thanks for the suggestion @dbzoo
extra: If the default keep_alive is left unchanged, I don't wrap, leaving the requests just like they used to be :)
Thanks for the suggestion @dbzoo
extra: If the default keep_alive is left unchanged, I don't wrap, leaving the requests just like they used to be :)
Thanks for taking that suggestion to heart. The code looks better for it, too. Nice job.