frob comments

Results 840 comments of


                                            frob

Empty output from chat-endpoint / non-empty endpoint for non-chat endpoint

The prompt that's passed to the runner is missing the trailing newline that the model uses as a starting point in generating the response. I think it works in most...

Empty output from chat-endpoint / non-empty endpoint for non-chat endpoint

Agreed, working on it.

Ollama ps says 22 GB, but nvidia-smi says 16GB with flash attention enabled

You have flash attention enabled. ollama computes memory requirements but it's llama.cpp that actually does the memory allocations. Flash attention is a more efficient use of VRAM, so llama.cpp doesn't...

Ollama ps says 22 GB, but nvidia-smi says 16GB with flash attention enabled

Yes, ollama will spill when it doesn't need to. Flash attention is a relatively recent addition to ollama and it doesn't work for some architectures (deepseek2), so it's not in...

Ollama ps says 22 GB, but nvidia-smi says 16GB with flash attention enabled

Note that the calculations that ollama does are only used to determine how many layers it asks llama.cpp to load into VRAM. You can override that with `num_gpu`. So if...

Ollama ps says 22 GB, but nvidia-smi says 16GB with flash attention enabled

I think the only way to achieve multiple models with this overcommit method would be to run two servers, one which loads the embedding model and the other loading the...

Ollama ps says 22 GB, but nvidia-smi says 16GB with flash attention enabled

Note that ollama (via `ollama ps`) is reporting GB (10^9) and nvtop is reporting GiB (1024^3). So in the no FA case, ollama is fairly close (20.13GiB = 20.13 *...

Ollama ps says 22 GB, but nvidia-smi says 16GB with flash attention enabled

Yes, ollama will spill when it doesn't need to. Note that the calculations that ollama does are only used to determine how many layers it asks llama.cpp to load into...

Ollama ps says 22 GB, but nvidia-smi says 16GB with flash attention enabled

"100%" means the model resides fully in VRAM, not that the VRAM is fully used. `nvidia-smi` will show how much memory the model is using in MiB.

Ollama ps says 22 GB, but nvidia-smi says 16GB with flash attention enabled

The size of the KV allocation is proportional to the number of sequences that the model is asked to process, so the discrepancy will grow more or less linearly with...