frob
frob
GPU utilization will vary on the efficiency of the model and external factors like power usage, etc. I don't know how this applies to VGPUs as they are a shared...
In the server log there will be lines like: ``` ollama | time=2024-08-14T22:59:28.178Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=14 layers.split="" memory.available="[11.6 GiB]" memory.required.full="55.9 GiB" memory.required.partial="11.4 GiB" memory.required.kv="640.0 MiB"...
What's in the server logs when it fails?
This is probably a transient issue, there are reports that Cloudfare has been a bit flaky.
What are the results of the following commands: `which ollama` `ldd ollama` `strace ollama -v`
What's the result of these commands: `command -v ollama` `ldd /usr/local/bin/ollama`
[Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) would help in debugging. What's the value of `$Modelname`?
If you add `OLLAMA_DEBUG=1` to the server environment the runner will print slot processing which may give insight in to what's causing the long processing times. Just to verify, the...
[Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging.
It's quite possible that the difference in build environment can be an effect. Note however that you are not comparing the same model: llama.cpp is using gemma-2-2b-it-Q4_K_M.gguf and ollama is...