frob
frob
The prompt that's passed to the runner is missing the trailing newline that the model uses as a starting point in generating the response. I think it works in most...
Agreed, working on it.
You have flash attention enabled. ollama computes memory requirements but it's llama.cpp that actually does the memory allocations. Flash attention is a more efficient use of VRAM, so llama.cpp doesn't...
Yes, ollama will spill when it doesn't need to. Flash attention is a relatively recent addition to ollama and it doesn't work for some architectures (deepseek2), so it's not in...
Note that the calculations that ollama does are only used to determine how many layers it asks llama.cpp to load into VRAM. You can override that with `num_gpu`. So if...
I think the only way to achieve multiple models with this overcommit method would be to run two servers, one which loads the embedding model and the other loading the...
Note that ollama (via `ollama ps`) is reporting GB (10^9) and nvtop is reporting GiB (1024^3). So in the no FA case, ollama is fairly close (20.13GiB = 20.13 *...
Yes, ollama will spill when it doesn't need to. Note that the calculations that ollama does are only used to determine how many layers it asks llama.cpp to load into...
"100%" means the model resides fully in VRAM, not that the VRAM is fully used. `nvidia-smi` will show how much memory the model is using in MiB.
The size of the KV allocation is proportional to the number of sequences that the model is asked to process, so the discrepancy will grow more or less linearly with...