ollama
ollama copied to clipboard
Prevent offloding
The model offloads after 5 min on the api, it would be nice to be able to prevent this
maybe add options for startup paramaters for llama cpp example prevent model from unloading from memory and stuff like that
What is the use case for this? Is it causing a problem? Preventing offloading seems to me to be not an optimal solution, as it could easily cause resource starvation and enable denial of service attacks. So, I recommend that the model timeout value be configurable, but with a reasonable maximum, something on the order of 15-30 minutes.
Supporting arguments:
- only ever want to use 1 model, no offloading will prevent prompting lag by not needing to load a LLM after 5min again.
- the inital lag when prompting and have to load large LLMs, causes web UIs to time out, even if only ever using the same LLM.
- if GPUs are connected using USB like in mining racks: huge LLMs loading to multiple GPUs takes time, inference is fast. If only 1 model is used, the initial delay is enormous.
An option to set the offloading timeout would be fine, setting it to a million years would also prevent offloading.
Helpless workaround: force a request each 4min to force it to not unload:
model=mixtral ; sleepDuration=4m ; nvidia-smi ; counter=0; while true; do ((counter++)) && echo "keep alive # $counter - $(date) ..." && http --timeout=120 :11434/api/generate model=$model stream:=false prompt='Only say "1". Do not explain. Do not comment. Short response only.' && echo "sleeping $sleepDuration for $model (# $counter, ctrl+Z to stop) - $(date)" && sleep $sleepDuration; done;
... keeps the contexts short (29) and this quick, and the response is always " 1"
for me with above prompt.
Why does it (have to?) offload automatically?
What is the use case for this? Is it causing a problem? Preventing offloading seems to me to be not an optimal solution, as it could easily cause resource starvation and enable denial of service attacks. So, I recommend that the model timeout value be configurable, but with a reasonable maximum, something on the order of 15-30 minutes.
To prevent starving resources or dos we could set an upper limit on resource and force reload when reached and log it carefully so we could trim it for a good fit..
While on the topic a health endpoint that shows memory usage, time loaded, statis ( loading, idle, processing ) would be nice top :)
Going to close this since #2146 has merged. You will be able to set it in 0.1.23
.