ollama icon indicating copy to clipboard operation
ollama copied to clipboard

Prevent offloding

Open Hansson0728 opened this issue 1 year ago • 2 comments

The model offloads after 5 min on the api, it would be nice to be able to prevent this

Hansson0728 avatar Jan 11 '24 16:01 Hansson0728

maybe add options for startup paramaters for llama cpp example prevent model from unloading from memory and stuff like that

Hansson0728 avatar Jan 11 '24 17:01 Hansson0728

What is the use case for this? Is it causing a problem? Preventing offloading seems to me to be not an optimal solution, as it could easily cause resource starvation and enable denial of service attacks. So, I recommend that the model timeout value be configurable, but with a reasonable maximum, something on the order of 15-30 minutes.

jimscard avatar Jan 12 '24 05:01 jimscard

Supporting arguments:

  1. only ever want to use 1 model, no offloading will prevent prompting lag by not needing to load a LLM after 5min again.
  2. the inital lag when prompting and have to load large LLMs, causes web UIs to time out, even if only ever using the same LLM.
  3. if GPUs are connected using USB like in mining racks: huge LLMs loading to multiple GPUs takes time, inference is fast. If only 1 model is used, the initial delay is enormous.

An option to set the offloading timeout would be fine, setting it to a million years would also prevent offloading.


Helpless workaround: force a request each 4min to force it to not unload:

model=mixtral ; sleepDuration=4m ; nvidia-smi ; counter=0; while true; do ((counter++)) && echo "keep alive # $counter - $(date) ..." && http --timeout=120 :11434/api/generate  model=$model stream:=false prompt='Only say "1". Do not explain. Do not comment. Short response only.' && echo "sleeping $sleepDuration for $model (# $counter, ctrl+Z to stop) - $(date)" && sleep $sleepDuration; done;

... keeps the contexts short (29) and this quick, and the response is always " 1" for me with above prompt.

BananaAcid avatar Jan 17 '24 12:01 BananaAcid

Why does it (have to?) offload automatically?

BananaAcid avatar Jan 18 '24 06:01 BananaAcid

What is the use case for this? Is it causing a problem? Preventing offloading seems to me to be not an optimal solution, as it could easily cause resource starvation and enable denial of service attacks. So, I recommend that the model timeout value be configurable, but with a reasonable maximum, something on the order of 15-30 minutes.

To prevent starving resources or dos we could set an upper limit on resource and force reload when reached and log it carefully so we could trim it for a good fit..

While on the topic a health endpoint that shows memory usage, time loaded, statis ( loading, idle, processing ) would be nice top :)

Hansson0728 avatar Jan 19 '24 23:01 Hansson0728

Going to close this since #2146 has merged. You will be able to set it in 0.1.23.

pdevine avatar Jan 28 '24 22:01 pdevine