ollama Prevent offloding

The model offloads after 5 min on the api, it would be nice to be able to prevent this

Jan 11 '24 16:01 Hansson0728

maybe add options for startup paramaters for llama cpp example prevent model from unloading from memory and stuff like that

Jan 11 '24 17:01 Hansson0728

What is the use case for this? Is it causing a problem? Preventing offloading seems to me to be not an optimal solution, as it could easily cause resource starvation and enable denial of service attacks. So, I recommend that the model timeout value be configurable, but with a reasonable maximum, something on the order of 15-30 minutes.

Jan 12 '24 05:01 jimscard

Supporting arguments:

only ever want to use 1 model, no offloading will prevent prompting lag by not needing to load a LLM after 5min again.
the inital lag when prompting and have to load large LLMs, causes web UIs to time out, even if only ever using the same LLM.
if GPUs are connected using USB like in mining racks: huge LLMs loading to multiple GPUs takes time, inference is fast. If only 1 model is used, the initial delay is enormous.

An option to set the offloading timeout would be fine, setting it to a million years would also prevent offloading.

Helpless workaround: force a request each 4min to force it to not unload:

model=mixtral ; sleepDuration=4m ; nvidia-smi ; counter=0; while true; do ((counter++)) && echo "keep alive # $counter - $(date) ..." && http --timeout=120 :11434/api/generate  model=$model stream:=false prompt='Only say "1". Do not explain. Do not comment. Short response only.' && echo "sleeping $sleepDuration for $model (# $counter, ctrl+Z to stop) - $(date)" && sleep $sleepDuration; done;

... keeps the contexts short (29) and this quick, and the response is always " 1" for me with above prompt.

Jan 17 '24 12:01 BananaAcid

Why does it (have to?) offload automatically?

Jan 18 '24 06:01 BananaAcid

What is the use case for this? Is it causing a problem? Preventing offloading seems to me to be not an optimal solution, as it could easily cause resource starvation and enable denial of service attacks. So, I recommend that the model timeout value be configurable, but with a reasonable maximum, something on the order of 15-30 minutes.

To prevent starving resources or dos we could set an upper limit on resource and force reload when reached and log it carefully so we could trim it for a good fit..

While on the topic a health endpoint that shows memory usage, time loaded, statis ( loading, idle, processing ) would be nice top :)

Jan 19 '24 23:01 Hansson0728

Going to close this since #2146 has merged. You will be able to set it in 0.1.23.

Jan 28 '24 22:01 pdevine

ollama ollama copied to clipboard

Prevent offloding

ollama
ollama copied to clipboard