llama.cpp
llama.cpp copied to clipboard
Misc. bug: llama-cli llama_backend_free may not free all the gpu memory
Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes version: 0 (unknown) built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-cli
Command line
./bin/llama-cli -m qwen2.5-1.5b.gguf -f questions.txt
Problem description & steps to reproduce
Hello, I have compile the llama-cli and execute the qwen2.5-1.5b model based on it. Before I runed the llama-cli, my gpu memory usage like this:
When the model running, the gpu memory usage like this:
However, after the program execute some memory free code:
common_sampler_free(smpl);
llama_backend_free();
ggml_threadpool_free_fn(threadpool);
ggml_threadpool_free_fn(threadpool_batch);
the gpu memory usage like this:
When the llama-cli terminate, the gpu memory usage restores to its pre-execution state.
It seems that, the gpu memory free code, cannot free all gpu memory!
First Bad Commit
I compile the llama-cli based on the newest code.
Relevant log output
Some data loaded by the CUDA runtime, such as the kernels, may remain in memory. You can try the following:
- Build with
GGML_BACKEND_DLenabled - Use
ggml_backend_loadto load the CUDA backend before using llama.cpp - When done, use
ggml_backend_unloadto unload the CUDA backend
That should free all resources allocated by the CUDA runtime, depending on how the driver handles this case.
@slaren What if we use static libraries instead of shared/dynamic libs? How to handle that? Btw, I experienced this on both Vulkan and CUDA backends on RTX 3090 24GB and A40 48GB cards.
I suppose you could try calling cudaDeviceReset yourself, but I am not sure if it will work.
This issue was closed because it has been inactive for 14 days since being marked as stale.