frob
frob
My advice was based on that specific file. A different file will require different corrective measures.
I see this when the server is behind a proxy. The proxy disconnects connections that have been idle for a while (60 seconds in my case). I'm unable to influence...
What is inconsistent?
> Don't split models at all unless you need to, ollama already does this. > and when you do need to split, split in this order of cards: 3060 1,...
If the model is being distributed across multiple devices, ollama thinks it doesn't fit in one GPU. Look at the logs for lines with `source=sched.go`, they will show the decisions...
Note if you have set `OLLAMA_SCHED_SPREAD=1` then ollama will always try to spread the model.
ollama runs [needsReload](https://github.com/ollama/ollama/blob/e9e9bdb8d904f009e8b1e54af9f77624d481cfb2/server/sched.go#L574) before each request. It includes a check for changes in parameters to the llama server, so if the rpc backends change, it should cause a model reload.
You can force ollama to use specific runners by setting `OLLAMA_LLM_LIBRARY` in the server enviroment, eg `OLLAMA_LLM_LIBRARY=cpu_avx2`.
Not sure I understand the question. ollama starts a runner per model, the hardware available normally dictates which runner is used - if CUDA is available, the cuda runner is...
I understand, you want to maximize performance when ollama can't offload all layers to the GPU. I did some tests and I see what you mean, when the cuda runner...