ollama icon indicating copy to clipboard operation
ollama copied to clipboard

Ollama v0.1.18+ does not fully unload from GPU when idle

Open richginsberg opened this issue 5 months ago • 10 comments

OS: Ubuntu 22.04 Environment: Docker/nvidia container Server: Dell Poweredge R720 GPUs: Nvidia Tesla P40 24GB GPU quantity: 2 Model: any (ie. dolphin-mixtral:8x7b-v2.5-q6_K)

 docker pull ollama/ollama:0.1.17
 docker run -d --gpus=all -v ~/ollama:/root/.ollama -p 11434:11434 --name ollama17 ollama/ollama:0.1.17
 docker exec -it ollama17 ollama run dolphin-mixtral:8x7b-v2.5-q6_K

Previous observation on Ollama v0.1.17. When model is loaded VRAM utilization is visible via nvidia-smi a pair of processes are also visible: ...p/gguf/build/cuda/bin/ollama-runner Each process uses 50-150w per GPU while running inference, 50-52w idle but model still loaded.

ollama-0 1 17_modelloaded

After a period of idle time, the model is unloaded. Both GPUs drop to 10-12w a piece with no visible process running

ollama-0 1 17_modelunloaded

 docker pull ollama/ollama:0.1.18
 docker run -d --gpus=all -v ~/ollama:/root/.ollama -p 11434:11434 --name ollama18 ollama/ollama:0.1.18
 docker exec -it ollama18 ollama run dolphin-mixtral:8x7b-v2.5-q6_K

Observation on Ollama v0.1.18. When model is loaded VRAM utilization is visible via nvidia-smi a pair of processes are also visible, but under a different path: /bin/ollama Each process uses 50-150w per GPU while running inference, 50-52w idle but model still loaded.

ollama-0 1 18_modelloaded

After a period of idle time, the model is unloaded, but process is still running. Both GPUs pull equivalent wattage as idle/model loaded.

ollama-0 1 18_modelunloaded

The server is powered on 24/7 and tuned to pull 120w w/o GPUs. Ollama is idle 95% of time. Prior, P40s were adding a combined 24w additional power draw idle under v0.1.17. Now with v0.1.18, the P40s adding a combined 110w additional power draw. 86w difference.

Does /bin/ollama need to be running the entire time?

richginsberg avatar Jan 08 '24 00:01 richginsberg

In 0.1.17 we leveraged a subprocess for the LLM runner accessing the GPU. After 5min of idle time, that subprocess was terminated, releasing all GPU allocations. In 0.1.18 we've transitioned to loading the LLM logic in-process, and while we're still unloading after 5min of idle, it looks like there's still some GPU memory allocation that isn't being freed up.

dhiltgen avatar Jan 08 '24 21:01 dhiltgen

In 0.1.17 we leveraged a subprocess for the LLM runner accessing the GPU. After 5min of idle time, that subprocess was terminated, releasing all GPU allocations. In 0.1.18 we've transitioned to loading the LLM logic in-process, and while we're still unloading after 5min of idle, it looks like there's still some GPU memory allocation that isn't being freed up

Yeah, I've noticed this: I can set num_gpu to a very tight value and it works fine when I load the model from a newly created Ollama instance (or newly respawned after OOM crash), but if I try to switch models then I get OOM error. From looking at nvidia-smi it's the wrapped llama.cpp server that isn't freeing all it's VRAM.

I tried adding a sleep after Ollama calls the "stop" command and had a look to see if anything in the server.cpp code wasn't being called to free something, but no luck and just have to accept an OOM crash when I change models atm.

jukofyork avatar Jan 08 '24 22:01 jukofyork

Digging around a bit more, I believe this is the result of llama.cpp not completely freeing up VRAM resources when the model is freed up. e.g. https://github.com/ggerganov/llama.cpp/issues/3717

We'll take a look at it, and keep an eye on upstream as well.

dhiltgen avatar Jan 09 '24 04:01 dhiltgen

Could be the cause of https://github.com/jmorganca/ollama/issues/1691

iplayfast avatar Jan 10 '24 05:01 iplayfast

With a slight modification to server.cpp and ggml-cuda.cu, I was able to get the upstream server to run under the cuda memory leak checker tool, and was able to find 4 leaks.

compute-sanitizer --tool memcheck --leak-check full ./bin/server ...

========= Leaked 8,388,608 bytes at 0x7faf2c000000
=========     Saved host backtrace up to driver entry point at allocation time
=========     Host Frame: [0x2db39f]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0xc33c3e]
=========                in /usr/local/cuda/lib64/libcublas.so.12
=========     Host Frame: [0xc00373]
=========                in /usr/local/cuda/lib64/libcublas.so.12
=========     Host Frame: [0xc422f5]
=========                in /usr/local/cuda/lib64/libcublas.so.12
=========     Host Frame: [0x8aa9bd]
=========                in /usr/local/cuda/lib64/libcublas.so.12
=========     Host Frame:cublasCreate_v2 [0x7f66f1]
=========                in /usr/local/cuda/lib64/libcublas.so.12
=========     Host Frame:ggml_init_cublas.part.0 in /home/daniel/code/llama.cpp/ggml-cuda.cu:8008 [0x199ee2]
=========                in /home/daniel/code/llama.cpp/build/./bin/server
=========     Host Frame:ggml_init in /home/daniel/code/llama.cpp/ggml.c:2428 [0x159070]
=========                in /home/daniel/code/llama.cpp/build/./bin/server
=========     Host Frame:llama_backend_init in /home/daniel/code/llama.cpp/llama.cpp:11191 [0xf1f8e]
=========                in /home/daniel/code/llama.cpp/build/./bin/server
=========     Host Frame:main in /home/daniel/code/llama.cpp/examples/server/server.cpp:2546 [0x25093]
=========                in /home/daniel/code/llama.cpp/build/./bin/server
=========     Host Frame:__libc_start_call_main in ../sysdeps/nptl/libc_start_call_main.h:58 [0x29d90]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:__libc_start_main in ../csu/libc-start.c:379 [0x29e40]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:_start [0x2e345]
=========                in /home/daniel/code/llama.cpp/build/./bin/server
========= 
========= Leaked 1,024 bytes at 0x7faf2dc00000
=========     Saved host backtrace up to driver entry point at allocation time
=========     Host Frame: [0x2db39f]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0xc33c3e]
=========                in /usr/local/cuda/lib64/libcublas.so.12
=========     Host Frame: [0xc00373]
=========                in /usr/local/cuda/lib64/libcublas.so.12
=========     Host Frame: [0xc422f5]
=========                in /usr/local/cuda/lib64/libcublas.so.12
=========     Host Frame: [0x8aa9bd]
=========                in /usr/local/cuda/lib64/libcublas.so.12
=========     Host Frame: [0x8aa20b]
=========                in /usr/local/cuda/lib64/libcublas.so.12
=========     Host Frame:cublasCreate_v2 [0x7f66f1]
=========                in /usr/local/cuda/lib64/libcublas.so.12
=========     Host Frame:ggml_init_cublas.part.0 in /home/daniel/code/llama.cpp/ggml-cuda.cu:8008 [0x199ee2]
=========                in /home/daniel/code/llama.cpp/build/./bin/server
=========     Host Frame:ggml_init in /home/daniel/code/llama.cpp/ggml.c:2428 [0x159070]
=========                in /home/daniel/code/llama.cpp/build/./bin/server
=========     Host Frame:llama_backend_init in /home/daniel/code/llama.cpp/llama.cpp:11191 [0xf1f8e]
=========                in /home/daniel/code/llama.cpp/build/./bin/server
=========     Host Frame:main in /home/daniel/code/llama.cpp/examples/server/server.cpp:2546 [0x25093]
=========                in /home/daniel/code/llama.cpp/build/./bin/server
=========     Host Frame:__libc_start_call_main in ../sysdeps/nptl/libc_start_call_main.h:58 [0x29d90]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:__libc_start_main in ../csu/libc-start.c:379 [0x29e40]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:_start [0x2e345]
=========                in /home/daniel/code/llama.cpp/build/./bin/server
========= 
========= Leaked 131,072 bytes at 0x7faf2dc00400
=========     Saved host backtrace up to driver entry point at allocation time
=========     Host Frame: [0x2db39f]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0xc33c3e]
=========                in /usr/local/cuda/lib64/libcublas.so.12
=========     Host Frame: [0xc00373]
=========                in /usr/local/cuda/lib64/libcublas.so.12
=========     Host Frame: [0xc422f5]
=========                in /usr/local/cuda/lib64/libcublas.so.12
=========     Host Frame: [0x8aa9bd]
=========                in /usr/local/cuda/lib64/libcublas.so.12
=========     Host Frame: [0x8aa22e]
=========                in /usr/local/cuda/lib64/libcublas.so.12
=========     Host Frame:cublasCreate_v2 [0x7f66f1]
=========                in /usr/local/cuda/lib64/libcublas.so.12
=========     Host Frame:ggml_init_cublas.part.0 in /home/daniel/code/llama.cpp/ggml-cuda.cu:8008 [0x199ee2]
=========                in /home/daniel/code/llama.cpp/build/./bin/server
=========     Host Frame:ggml_init in /home/daniel/code/llama.cpp/ggml.c:2428 [0x159070]
=========                in /home/daniel/code/llama.cpp/build/./bin/server
=========     Host Frame:llama_backend_init in /home/daniel/code/llama.cpp/llama.cpp:11191 [0xf1f8e]
=========                in /home/daniel/code/llama.cpp/build/./bin/server
=========     Host Frame:main in /home/daniel/code/llama.cpp/examples/server/server.cpp:2546 [0x25093]
=========                in /home/daniel/code/llama.cpp/build/./bin/server
=========     Host Frame:__libc_start_call_main in ../sysdeps/nptl/libc_start_call_main.h:58 [0x29d90]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:__libc_start_main in ../csu/libc-start.c:379 [0x29e40]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:_start [0x2e345]
=========                in /home/daniel/code/llama.cpp/build/./bin/server
========= 
========= Leaked 2,097,152 bytes at 0x4ea000000
=========     Saved host backtrace up to driver entry point at allocation time
=========     Host Frame: [0x2e90ad]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:ggml_cuda_pool_malloc_vmm(int, unsigned long, unsigned long*) in /home/daniel/code/llama.cpp/ggml-cuda.cu:7834 [0x1b2e12]
=========                in /home/daniel/code/llama.cpp/build/./bin/server
=========     Host Frame:ggml_cuda_op_mul_mat(ggml_tensor const*, ggml_tensor const*, ggml_tensor*, void (*)(ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, CUstream_st*), bool) in /home/daniel/code/llama.cpp/ggml-cuda.cu:9398 [0x1b4004]
=========                in /home/daniel/code/llama.cpp/build/./bin/server
=========     Host Frame:ggml_cuda_compute_forward.part.0 in /home/daniel/code/llama.cpp/ggml-cuda.cu:10632 [0x19a3f5]
=========                in /home/daniel/code/llama.cpp/build/./bin/server
=========     Host Frame:ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) in /home/daniel/code/llama.cpp/ggml-cuda.cu:11323 [0x19a862]
=========                in /home/daniel/code/llama.cpp/build/./bin/server
=========     Host Frame:ggml_backend_sched_graph_compute in /home/daniel/code/llama.cpp/ggml-backend.c:1583 [0x179330]
=========                in /home/daniel/code/llama.cpp/build/./bin/server
=========     Host Frame:llama_decode_internal(llama_context&, llama_batch) in /home/daniel/code/llama.cpp/llama.cpp:7722 [0xf8eed]
=========                in /home/daniel/code/llama.cpp/build/./bin/server
=========     Host Frame:llama_decode in /home/daniel/code/llama.cpp/llama.cpp:12287 [0xf9aa3]
=========                in /home/daniel/code/llama.cpp/build/./bin/server
=========     Host Frame:llama_init_from_gpt_params(gpt_params&) in /home/daniel/code/llama.cpp/common/common.cpp:1361 [0xd8e6d]
=========                in /home/daniel/code/llama.cpp/build/./bin/server
=========     Host Frame:llama_server_context::load_model(gpt_params const&) in /home/daniel/code/llama.cpp/examples/server/server.cpp:383 [0x8024d]
=========                in /home/daniel/code/llama.cpp/build/./bin/server
=========     Host Frame:main in /home/daniel/code/llama.cpp/examples/server/server.cpp:2669 [0x262d4]
=========                in /home/daniel/code/llama.cpp/build/./bin/server
=========     Host Frame:__libc_start_call_main in ../sysdeps/nptl/libc_start_call_main.h:58 [0x29d90]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:__libc_start_main in ../csu/libc-start.c:379 [0x29e40]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:_start [0x2e345]
=========                in /home/daniel/code/llama.cpp/build/./bin/server
========= 
========= LEAK SUMMARY: 10617856 bytes leaked in 4 allocations
========= ERROR SUMMARY: 4 errors

The first 3 are all the same call site and the fix is pretty straight forward. We just need to add a call to cublasDestroy at shutdown of the server.

I haven't quite figured out the last one yet though.

dhiltgen avatar Feb 18 '24 01:02 dhiltgen

Hi @dhiltgen! I think this one might not be fully fixed as of version 0.1.27.

I am also running an Nvidia P40 on Linux and still see around 50w of GPU usage and around 230mb of GPU memory occupied after the chat session is stopped and in idle mode.

The only thing that helps fully unload the GPU is restarting the service manually by calling sudo service ollama restart.

Here is the nvidia-smi output after the session has been closed and server was idle for a while (over 5 minutes):

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P40                      On  | 00000000:01:00.0 Off |                  Off |
| N/A   57C    P0              53W / 175W |    240MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A       825      G   /usr/lib/xorg/Xorg                            4MiB |
|    0   N/A  N/A   1670919      C   /usr/local/bin/ollama                       234MiB |
+---------------------------------------------------------------------------------------+

and here is another output after forcefully restarting the service:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P40                      On  | 00000000:01:00.0 Off |                  Off |
| N/A   44C    P8              10W / 175W |      4MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A       825      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+

OS: Debian 12 Environment: Bare metal GPUs: 1x Nvidia Tesla P40 24GB Other hardware: Intel 8gen B360 mobo + i5 8600, 16gb DDR4 Model: any (e.g. miqu-1-70b.q2_K)

Please let me know if I can be of any help

fomalsd avatar Feb 24 '24 16:02 fomalsd

Due to the way llama.cpp allocates memory on VMM vs. non-VMM cards, the fix seems to only be complete on VMM cards. I've opened a new issue #2767 to track fixing this for non-VMM GPUs.

dhiltgen avatar Feb 26 '24 16:02 dhiltgen

Actually I don't think this fix is complete for VMM either - tried this in Docker+Nvidia Container like the original poster did and still getting idle GPU usage on 0.1.27:

docker pull ollama/ollama:0.1.27
docker run -d --gpus=all -v ~/ollama:/root/.ollama -p 11434:11434 --name ollama-27 ollama/ollama:0.1.27
docker exec -it ollama-27 ollama run dolphin-mixtral

When model is loaded and active:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P40                      On  | 00000000:01:00.0 Off |                  Off |
| N/A   48C    P0              58W / 175W |  22688MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A       828      G   /usr/lib/xorg/Xorg                            4MiB |
|    0   N/A  N/A    177482      C   /bin/ollama                               22682MiB |
+---------------------------------------------------------------------------------------+

After 5 minutes of inactivity:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P40                      On  | 00000000:01:00.0 Off |                  Off |
| N/A   54C    P0              53W / 175W |    238MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A       828      G   /usr/lib/xorg/Xorg                            4MiB |
|    0   N/A  N/A    177482      C   /bin/ollama                                 232MiB |
+---------------------------------------------------------------------------------------+

While the older v0.1.17 fully unloads when idle, consistent with @richginsberg's initial report.

fomalsd avatar Feb 29 '24 09:02 fomalsd

This should be resolved by #3218

dhiltgen avatar Mar 20 '24 16:03 dhiltgen

This should be resolved by #3218

Not fixed for me. Before updating, ollama didn't use any (significant, at least) memory on startup. Now, the instance mapped to my 1080 Ti (11 GiB) is using 136 MiB and the instances mapped to my 1070 Ti's (8 GiB) are using 100 MiB each. This is before loading any models. Not too cool. Restarting flushes memory, but then refills

oldmanjk avatar Mar 30 '24 04:03 oldmanjk

This should be resolved by #3218

Just tested v0.1.30, the issue is still present.

ollama-p40-issues-2

richginsberg avatar Apr 01 '24 19:04 richginsberg

Why wasn't this tested before release?

oldmanjk avatar Apr 02 '24 02:04 oldmanjk

This should be resolved by #3218

Not fixed for me. Before updating, ollama didn't use any (significant, at least) memory on startup. Now, the instance mapped to my 1080 Ti (11 GiB) is using 136 MiB and the instances mapped to my 1070 Ti's (8 GiB) are using 100 MiB each. This is before loading any models. Not too cool. Restarting flushes memory, but then refills

To clarify, there is a new failure mode, in v0.1.30. In previous versions, the ollama process would unload model and then the process persisted in GPU memory until restart of the ollama process/docker container. In v0.1.30, ollama also loads itself into memory after restart. This makes things more challenging as in the case of a card like Telsa P40, there is continuous 50W draw from the card on an idle system.

chereszabor avatar Apr 02 '24 13:04 chereszabor

Above I confirmed the issue persists in v0.1.30. To confirm is wasn't new from v0.1.30, I tried in v0.1.29. Same issue.

docker run -d --gpus=all -v /home/username/ollama:/root/.ollama -p 11434:11434 --name ollama29 ollama/ollama:0.1.29

Screenshot 2024-04-02 at 1 51 11 PM

The only way for me to drop back to 10w per GPU is: docker container stop xxxxxx

I'm consuming 164w addition per hour idle on quad Tesla P40 server. Reverting back to ollama:0.1.17 for now.

richginsberg avatar Apr 02 '24 17:04 richginsberg

I guess we'd all have to check the future version 0.1.31 since the fix was only merged like an hour ago :)

fomalsd avatar Apr 02 '24 18:04 fomalsd

I just tested with 0.1.31 and my P40s are still affected. With 0.1.17 the gpus get released properly.

raldone01 avatar Apr 13 '24 10:04 raldone01

This really needs to get fixed. Currently, Ollama basically (please correct me if I'm wrong):

  • runs at start without user's knowledge or permission
  • is using GPU resources even when the user isn't interacting with it
  • provides no obvious indication it's running and no obvious way of stopping it, temporarily or permanently (I don't consider interacting with systemctl obvious)

I don't mean to sound dramatic (or do I?), but doesn't this sound like malware as it currently exists? For a novice user, I'd argue it is. The windows setup makes more sense to me, personally (taskbar icon)

oldmanjk avatar Apr 14 '24 03:04 oldmanjk

Fixed for me in ollama/ollama:0.1.32. Thanks a bunch!

fomalsd avatar Apr 15 '24 09:04 fomalsd