frob

Results 846 comments of frob

My advice was based on that specific file. A different file will require different corrective measures.

I see this when the server is behind a proxy. The proxy disconnects connections that have been idle for a while (60 seconds in my case). I'm unable to influence...

> Don't split models at all unless you need to, ollama already does this. > and when you do need to split, split in this order of cards: 3060 1,...

If the model is being distributed across multiple devices, ollama thinks it doesn't fit in one GPU. Look at the logs for lines with `source=sched.go`, they will show the decisions...

Note if you have set `OLLAMA_SCHED_SPREAD=1` then ollama will always try to spread the model.

ollama runs [needsReload](https://github.com/ollama/ollama/blob/e9e9bdb8d904f009e8b1e54af9f77624d481cfb2/server/sched.go#L574) before each request. It includes a check for changes in parameters to the llama server, so if the rpc backends change, it should cause a model reload.

You can force ollama to use specific runners by setting `OLLAMA_LLM_LIBRARY` in the server enviroment, eg `OLLAMA_LLM_LIBRARY=cpu_avx2`.

Not sure I understand the question. ollama starts a runner per model, the hardware available normally dictates which runner is used - if CUDA is available, the cuda runner is...

I understand, you want to maximize performance when ollama can't offload all layers to the GPU. I did some tests and I see what you mean, when the cuda runner...