ollama
ollama copied to clipboard
Ollama v0.1.18+ does not fully unload from GPU when idle
OS: Ubuntu 22.04 Environment: Docker/nvidia container Server: Dell Poweredge R720 GPUs: Nvidia Tesla P40 24GB GPU quantity: 2 Model: any (ie. dolphin-mixtral:8x7b-v2.5-q6_K)
docker pull ollama/ollama:0.1.17
docker run -d --gpus=all -v ~/ollama:/root/.ollama -p 11434:11434 --name ollama17 ollama/ollama:0.1.17
docker exec -it ollama17 ollama run dolphin-mixtral:8x7b-v2.5-q6_K
Previous observation on Ollama v0.1.17. When model is loaded VRAM utilization is visible via nvidia-smi a pair of processes are also visible:
...p/gguf/build/cuda/bin/ollama-runner
Each process uses 50-150w per GPU while running inference, 50-52w idle but model still loaded.
After a period of idle time, the model is unloaded. Both GPUs drop to 10-12w a piece with no visible process running
docker pull ollama/ollama:0.1.18
docker run -d --gpus=all -v ~/ollama:/root/.ollama -p 11434:11434 --name ollama18 ollama/ollama:0.1.18
docker exec -it ollama18 ollama run dolphin-mixtral:8x7b-v2.5-q6_K
Observation on Ollama v0.1.18. When model is loaded VRAM utilization is visible via nvidia-smi a pair of processes are also visible, but under a different path:
/bin/ollama
Each process uses 50-150w per GPU while running inference, 50-52w idle but model still loaded.
After a period of idle time, the model is unloaded, but process is still running. Both GPUs pull equivalent wattage as idle/model loaded.
The server is powered on 24/7 and tuned to pull 120w w/o GPUs. Ollama is idle 95% of time. Prior, P40s were adding a combined 24w additional power draw idle under v0.1.17. Now with v0.1.18, the P40s adding a combined 110w additional power draw. 86w difference.
Does /bin/ollama need to be running the entire time?
In 0.1.17 we leveraged a subprocess for the LLM runner accessing the GPU. After 5min of idle time, that subprocess was terminated, releasing all GPU allocations. In 0.1.18 we've transitioned to loading the LLM logic in-process, and while we're still unloading after 5min of idle, it looks like there's still some GPU memory allocation that isn't being freed up.
In 0.1.17 we leveraged a subprocess for the LLM runner accessing the GPU. After 5min of idle time, that subprocess was terminated, releasing all GPU allocations. In 0.1.18 we've transitioned to loading the LLM logic in-process, and while we're still unloading after 5min of idle, it looks like there's still some GPU memory allocation that isn't being freed up
Yeah, I've noticed this: I can set num_gpu to a very tight value and it works fine when I load the model from a newly created Ollama instance (or newly respawned after OOM crash), but if I try to switch models then I get OOM error. From looking at nvidia-smi it's the wrapped llama.cpp server that isn't freeing all it's VRAM.
I tried adding a sleep after Ollama calls the "stop" command and had a look to see if anything in the server.cpp code wasn't being called to free something, but no luck and just have to accept an OOM crash when I change models atm.
Digging around a bit more, I believe this is the result of llama.cpp not completely freeing up VRAM resources when the model is freed up. e.g. https://github.com/ggerganov/llama.cpp/issues/3717
We'll take a look at it, and keep an eye on upstream as well.
Could be the cause of https://github.com/jmorganca/ollama/issues/1691
With a slight modification to server.cpp and ggml-cuda.cu, I was able to get the upstream server to run under the cuda memory leak checker tool, and was able to find 4 leaks.
compute-sanitizer --tool memcheck --leak-check full ./bin/server ...
========= Leaked 8,388,608 bytes at 0x7faf2c000000
========= Saved host backtrace up to driver entry point at allocation time
========= Host Frame: [0x2db39f]
========= in /lib/x86_64-linux-gnu/libcuda.so.1
========= Host Frame: [0xc33c3e]
========= in /usr/local/cuda/lib64/libcublas.so.12
========= Host Frame: [0xc00373]
========= in /usr/local/cuda/lib64/libcublas.so.12
========= Host Frame: [0xc422f5]
========= in /usr/local/cuda/lib64/libcublas.so.12
========= Host Frame: [0x8aa9bd]
========= in /usr/local/cuda/lib64/libcublas.so.12
========= Host Frame:cublasCreate_v2 [0x7f66f1]
========= in /usr/local/cuda/lib64/libcublas.so.12
========= Host Frame:ggml_init_cublas.part.0 in /home/daniel/code/llama.cpp/ggml-cuda.cu:8008 [0x199ee2]
========= in /home/daniel/code/llama.cpp/build/./bin/server
========= Host Frame:ggml_init in /home/daniel/code/llama.cpp/ggml.c:2428 [0x159070]
========= in /home/daniel/code/llama.cpp/build/./bin/server
========= Host Frame:llama_backend_init in /home/daniel/code/llama.cpp/llama.cpp:11191 [0xf1f8e]
========= in /home/daniel/code/llama.cpp/build/./bin/server
========= Host Frame:main in /home/daniel/code/llama.cpp/examples/server/server.cpp:2546 [0x25093]
========= in /home/daniel/code/llama.cpp/build/./bin/server
========= Host Frame:__libc_start_call_main in ../sysdeps/nptl/libc_start_call_main.h:58 [0x29d90]
========= in /lib/x86_64-linux-gnu/libc.so.6
========= Host Frame:__libc_start_main in ../csu/libc-start.c:379 [0x29e40]
========= in /lib/x86_64-linux-gnu/libc.so.6
========= Host Frame:_start [0x2e345]
========= in /home/daniel/code/llama.cpp/build/./bin/server
=========
========= Leaked 1,024 bytes at 0x7faf2dc00000
========= Saved host backtrace up to driver entry point at allocation time
========= Host Frame: [0x2db39f]
========= in /lib/x86_64-linux-gnu/libcuda.so.1
========= Host Frame: [0xc33c3e]
========= in /usr/local/cuda/lib64/libcublas.so.12
========= Host Frame: [0xc00373]
========= in /usr/local/cuda/lib64/libcublas.so.12
========= Host Frame: [0xc422f5]
========= in /usr/local/cuda/lib64/libcublas.so.12
========= Host Frame: [0x8aa9bd]
========= in /usr/local/cuda/lib64/libcublas.so.12
========= Host Frame: [0x8aa20b]
========= in /usr/local/cuda/lib64/libcublas.so.12
========= Host Frame:cublasCreate_v2 [0x7f66f1]
========= in /usr/local/cuda/lib64/libcublas.so.12
========= Host Frame:ggml_init_cublas.part.0 in /home/daniel/code/llama.cpp/ggml-cuda.cu:8008 [0x199ee2]
========= in /home/daniel/code/llama.cpp/build/./bin/server
========= Host Frame:ggml_init in /home/daniel/code/llama.cpp/ggml.c:2428 [0x159070]
========= in /home/daniel/code/llama.cpp/build/./bin/server
========= Host Frame:llama_backend_init in /home/daniel/code/llama.cpp/llama.cpp:11191 [0xf1f8e]
========= in /home/daniel/code/llama.cpp/build/./bin/server
========= Host Frame:main in /home/daniel/code/llama.cpp/examples/server/server.cpp:2546 [0x25093]
========= in /home/daniel/code/llama.cpp/build/./bin/server
========= Host Frame:__libc_start_call_main in ../sysdeps/nptl/libc_start_call_main.h:58 [0x29d90]
========= in /lib/x86_64-linux-gnu/libc.so.6
========= Host Frame:__libc_start_main in ../csu/libc-start.c:379 [0x29e40]
========= in /lib/x86_64-linux-gnu/libc.so.6
========= Host Frame:_start [0x2e345]
========= in /home/daniel/code/llama.cpp/build/./bin/server
=========
========= Leaked 131,072 bytes at 0x7faf2dc00400
========= Saved host backtrace up to driver entry point at allocation time
========= Host Frame: [0x2db39f]
========= in /lib/x86_64-linux-gnu/libcuda.so.1
========= Host Frame: [0xc33c3e]
========= in /usr/local/cuda/lib64/libcublas.so.12
========= Host Frame: [0xc00373]
========= in /usr/local/cuda/lib64/libcublas.so.12
========= Host Frame: [0xc422f5]
========= in /usr/local/cuda/lib64/libcublas.so.12
========= Host Frame: [0x8aa9bd]
========= in /usr/local/cuda/lib64/libcublas.so.12
========= Host Frame: [0x8aa22e]
========= in /usr/local/cuda/lib64/libcublas.so.12
========= Host Frame:cublasCreate_v2 [0x7f66f1]
========= in /usr/local/cuda/lib64/libcublas.so.12
========= Host Frame:ggml_init_cublas.part.0 in /home/daniel/code/llama.cpp/ggml-cuda.cu:8008 [0x199ee2]
========= in /home/daniel/code/llama.cpp/build/./bin/server
========= Host Frame:ggml_init in /home/daniel/code/llama.cpp/ggml.c:2428 [0x159070]
========= in /home/daniel/code/llama.cpp/build/./bin/server
========= Host Frame:llama_backend_init in /home/daniel/code/llama.cpp/llama.cpp:11191 [0xf1f8e]
========= in /home/daniel/code/llama.cpp/build/./bin/server
========= Host Frame:main in /home/daniel/code/llama.cpp/examples/server/server.cpp:2546 [0x25093]
========= in /home/daniel/code/llama.cpp/build/./bin/server
========= Host Frame:__libc_start_call_main in ../sysdeps/nptl/libc_start_call_main.h:58 [0x29d90]
========= in /lib/x86_64-linux-gnu/libc.so.6
========= Host Frame:__libc_start_main in ../csu/libc-start.c:379 [0x29e40]
========= in /lib/x86_64-linux-gnu/libc.so.6
========= Host Frame:_start [0x2e345]
========= in /home/daniel/code/llama.cpp/build/./bin/server
=========
========= Leaked 2,097,152 bytes at 0x4ea000000
========= Saved host backtrace up to driver entry point at allocation time
========= Host Frame: [0x2e90ad]
========= in /lib/x86_64-linux-gnu/libcuda.so.1
========= Host Frame:ggml_cuda_pool_malloc_vmm(int, unsigned long, unsigned long*) in /home/daniel/code/llama.cpp/ggml-cuda.cu:7834 [0x1b2e12]
========= in /home/daniel/code/llama.cpp/build/./bin/server
========= Host Frame:ggml_cuda_op_mul_mat(ggml_tensor const*, ggml_tensor const*, ggml_tensor*, void (*)(ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, CUstream_st*), bool) in /home/daniel/code/llama.cpp/ggml-cuda.cu:9398 [0x1b4004]
========= in /home/daniel/code/llama.cpp/build/./bin/server
========= Host Frame:ggml_cuda_compute_forward.part.0 in /home/daniel/code/llama.cpp/ggml-cuda.cu:10632 [0x19a3f5]
========= in /home/daniel/code/llama.cpp/build/./bin/server
========= Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) in /home/daniel/code/llama.cpp/ggml-cuda.cu:11323 [0x19a862]
========= in /home/daniel/code/llama.cpp/build/./bin/server
========= Host Frame:ggml_backend_sched_graph_compute in /home/daniel/code/llama.cpp/ggml-backend.c:1583 [0x179330]
========= in /home/daniel/code/llama.cpp/build/./bin/server
========= Host Frame:llama_decode_internal(llama_context&, llama_batch) in /home/daniel/code/llama.cpp/llama.cpp:7722 [0xf8eed]
========= in /home/daniel/code/llama.cpp/build/./bin/server
========= Host Frame:llama_decode in /home/daniel/code/llama.cpp/llama.cpp:12287 [0xf9aa3]
========= in /home/daniel/code/llama.cpp/build/./bin/server
========= Host Frame:llama_init_from_gpt_params(gpt_params&) in /home/daniel/code/llama.cpp/common/common.cpp:1361 [0xd8e6d]
========= in /home/daniel/code/llama.cpp/build/./bin/server
========= Host Frame:llama_server_context::load_model(gpt_params const&) in /home/daniel/code/llama.cpp/examples/server/server.cpp:383 [0x8024d]
========= in /home/daniel/code/llama.cpp/build/./bin/server
========= Host Frame:main in /home/daniel/code/llama.cpp/examples/server/server.cpp:2669 [0x262d4]
========= in /home/daniel/code/llama.cpp/build/./bin/server
========= Host Frame:__libc_start_call_main in ../sysdeps/nptl/libc_start_call_main.h:58 [0x29d90]
========= in /lib/x86_64-linux-gnu/libc.so.6
========= Host Frame:__libc_start_main in ../csu/libc-start.c:379 [0x29e40]
========= in /lib/x86_64-linux-gnu/libc.so.6
========= Host Frame:_start [0x2e345]
========= in /home/daniel/code/llama.cpp/build/./bin/server
=========
========= LEAK SUMMARY: 10617856 bytes leaked in 4 allocations
========= ERROR SUMMARY: 4 errors
The first 3 are all the same call site and the fix is pretty straight forward. We just need to add a call to cublasDestroy
at shutdown of the server.
I haven't quite figured out the last one yet though.
Hi @dhiltgen! I think this one might not be fully fixed as of version 0.1.27
.
I am also running an Nvidia P40 on Linux and still see around 50w of GPU usage and around 230mb of GPU memory occupied after the chat session is stopped and in idle mode.
The only thing that helps fully unload the GPU is restarting the service manually by calling sudo service ollama restart
.
Here is the nvidia-smi
output after the session has been closed and server was idle for a while (over 5 minutes):
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla P40 On | 00000000:01:00.0 Off | Off |
| N/A 57C P0 53W / 175W | 240MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 825 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 1670919 C /usr/local/bin/ollama 234MiB |
+---------------------------------------------------------------------------------------+
and here is another output after forcefully restarting the service:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla P40 On | 00000000:01:00.0 Off | Off |
| N/A 44C P8 10W / 175W | 4MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 825 G /usr/lib/xorg/Xorg 4MiB |
+---------------------------------------------------------------------------------------+
OS: Debian 12 Environment: Bare metal GPUs: 1x Nvidia Tesla P40 24GB Other hardware: Intel 8gen B360 mobo + i5 8600, 16gb DDR4 Model: any (e.g. miqu-1-70b.q2_K)
Please let me know if I can be of any help
Due to the way llama.cpp allocates memory on VMM vs. non-VMM cards, the fix seems to only be complete on VMM cards. I've opened a new issue #2767 to track fixing this for non-VMM GPUs.
Actually I don't think this fix is complete for VMM either - tried this in Docker+Nvidia Container like the original poster did and still getting idle GPU usage on 0.1.27:
docker pull ollama/ollama:0.1.27
docker run -d --gpus=all -v ~/ollama:/root/.ollama -p 11434:11434 --name ollama-27 ollama/ollama:0.1.27
docker exec -it ollama-27 ollama run dolphin-mixtral
When model is loaded and active:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla P40 On | 00000000:01:00.0 Off | Off |
| N/A 48C P0 58W / 175W | 22688MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 828 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 177482 C /bin/ollama 22682MiB |
+---------------------------------------------------------------------------------------+
After 5 minutes of inactivity:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla P40 On | 00000000:01:00.0 Off | Off |
| N/A 54C P0 53W / 175W | 238MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 828 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 177482 C /bin/ollama 232MiB |
+---------------------------------------------------------------------------------------+
While the older v0.1.17 fully unloads when idle, consistent with @richginsberg's initial report.
This should be resolved by #3218
This should be resolved by #3218
Not fixed for me. Before updating, ollama didn't use any (significant, at least) memory on startup. Now, the instance mapped to my 1080 Ti (11 GiB) is using 136 MiB and the instances mapped to my 1070 Ti's (8 GiB) are using 100 MiB each. This is before loading any models. Not too cool. Restarting flushes memory, but then refills
This should be resolved by #3218
Just tested v0.1.30, the issue is still present.
Why wasn't this tested before release?
This should be resolved by #3218
Not fixed for me. Before updating, ollama didn't use any (significant, at least) memory on startup. Now, the instance mapped to my 1080 Ti (11 GiB) is using 136 MiB and the instances mapped to my 1070 Ti's (8 GiB) are using 100 MiB each. This is before loading any models. Not too cool. Restarting flushes memory, but then refills
To clarify, there is a new failure mode, in v0.1.30. In previous versions, the ollama process would unload model and then the process persisted in GPU memory until restart of the ollama process/docker container. In v0.1.30, ollama also loads itself into memory after restart. This makes things more challenging as in the case of a card like Telsa P40, there is continuous 50W draw from the card on an idle system.
Above I confirmed the issue persists in v0.1.30. To confirm is wasn't new from v0.1.30, I tried in v0.1.29. Same issue.
docker run -d --gpus=all -v /home/username/ollama:/root/.ollama -p 11434:11434 --name ollama29 ollama/ollama:0.1.29
The only way for me to drop back to 10w per GPU is:
docker container stop xxxxxx
I'm consuming 164w addition per hour idle on quad Tesla P40 server. Reverting back to ollama:0.1.17 for now.
I guess we'd all have to check the future version 0.1.31 since the fix was only merged like an hour ago :)
I just tested with 0.1.31 and my P40s are still affected. With 0.1.17 the gpus get released properly.
This really needs to get fixed. Currently, Ollama basically (please correct me if I'm wrong):
- runs at start without user's knowledge or permission
- is using GPU resources even when the user isn't interacting with it
- provides no obvious indication it's running and no obvious way of stopping it, temporarily or permanently (I don't consider interacting with systemctl obvious)
I don't mean to sound dramatic (or do I?), but doesn't this sound like malware as it currently exists? For a novice user, I'd argue it is. The windows setup makes more sense to me, personally (taskbar icon)
Fixed for me in ollama/ollama:0.1.32. Thanks a bunch!