llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Misc. bug: llama-cli llama_backend_free may not free all the gpu memory

Open GaoXiangYa opened this issue 9 months ago • 3 comments
trafficstars

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes version: 0 (unknown) built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-cli

Command line

./bin/llama-cli -m qwen2.5-1.5b.gguf -f questions.txt

Problem description & steps to reproduce

Hello, I have compile the llama-cli and execute the qwen2.5-1.5b model based on it. Before I runed the llama-cli, my gpu memory usage like this:

Image

When the model running, the gpu memory usage like this:

Image

However, after the program execute some memory free code:

common_sampler_free(smpl);
llama_backend_free();
ggml_threadpool_free_fn(threadpool);
ggml_threadpool_free_fn(threadpool_batch);

the gpu memory usage like this:

Image

When the llama-cli terminate, the gpu memory usage restores to its pre-execution state.

It seems that, the gpu memory free code, cannot free all gpu memory!

First Bad Commit

I compile the llama-cli based on the newest code.

Relevant log output


GaoXiangYa avatar Feb 25 '25 01:02 GaoXiangYa

Some data loaded by the CUDA runtime, such as the kernels, may remain in memory. You can try the following:

  • Build with GGML_BACKEND_DL enabled
  • Use ggml_backend_load to load the CUDA backend before using llama.cpp
  • When done, use ggml_backend_unload to unload the CUDA backend

That should free all resources allocated by the CUDA runtime, depending on how the driver handles this case.

slaren avatar Feb 25 '25 02:02 slaren

@slaren What if we use static libraries instead of shared/dynamic libs? How to handle that? Btw, I experienced this on both Vulkan and CUDA backends on RTX 3090 24GB and A40 48GB cards.

mtasic85 avatar Feb 26 '25 19:02 mtasic85

I suppose you could try calling cudaDeviceReset yourself, but I am not sure if it will work.

slaren avatar Feb 26 '25 19:02 slaren

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Apr 12 '25 01:04 github-actions[bot]