llama-cpp-python
llama-cpp-python copied to clipboard
Models are failing to be properly unloaded and freeing up VRAM
Expected Behavior
From the issue #302 , I expected the model to be unloaded with the following function:
def unload_model():
global llm
llama_free_model(llm)
# Delete the model object
del llm
llm = None # Ensure no reference remains
# Explicitly invoke the garbage collector
gc.collect()
return {"message": "Model unloaded successfully"}
However, there are two problems here:
1 - Using llama_free_model with the object llm (which is conventionally loaded) is resulting in this:
Traceback (most recent call last):
File "/run/media/myserver/5dcc41df-7194-4e57-a28f-833dc5ce81bb/llamacpp/app.py", line 48, in <module>
llama_free_model(llm)
ctypes.ArgumentError: argument 1: TypeError: wrong type
'llm' is generated with this:
llm = Llama(
model_path=model_path,
chat_handler=chat_handler,
n_gpu_layers=gpu_layers,
n_ctx=n_ctx
)
2 - Even after deleting the object, assigning as None and invoking the garbage collection, the VRAM is still not freed. The VRAM only gets cleared after I kill the app along all of its processes and threads.
Current Behavior
1- llama_free_model does not work. 2 - Garbage collection not freeing up VRAM.
Environment and Context
I tried this on both an Arch Linux setup with an RTX 3090 and a Windows laptop with an eGPU. This problem was consistent on those two different OSes and different hardware setups.
- Physical (or virtual) hardware you are using, e.g. for Linux:
AMD Ryzen 7 2700 Eight-Core Processor NVIDIA GeForce RTX 3090
- Operating System, e.g. for Linux:
Arch Linux 6.8.9-arch1-1 Windows 11
Python 3.12.3
GNU Make 4.4.1
g++ (GCC) 13.2.1 20240417
Failure Information (for bugs)
Traceback (most recent call last):
File "/run/media/myserver/5dcc41df-7194-4e57-a28f-833dc5ce81bb/llamacpp/app.py", line 48, in <module>
llama_free_model(llm)
ctypes.ArgumentError: argument 1: TypeError: wrong type
Steps to Reproduce
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
- Perform a free install of llama-cpp-python, with CUDA support
- Write a code snippet to load the model as usual
- Try to use llama_free_model to unload the model, or delete the model object and invoke garbage collection
- Make sure to keep the app running afterwards and check VRAM with nvidia-smi
From what I can see, llama_free_model
is expected to take a lower-level object instead of the Llama
object. In Python, determining when the garbage collector actually deletes an object is not straightforward. Here is a workaround that forces the release of the loaded model:
from llama_cpp import Llama
llama_model = Llama(…)
# Explicitly delete the model's internal object
llama_model._model.__del__()
This approach has worked for me so far.
In my experience, @jkawamoto approach is a good one, because it frees RAM/CUDA/other memory, even if the Llama object is stuck.
I've tried calling del llama_model
, but this is not guaranteed to actually call __del__
if there are references to the object (and this can happen in several cases, like for example from uncaught exceptions in interactive environments like Jupyterlab - see here )
Since calling a special method (__del__
) of a private field is too ad hoc, I opened a PR #1513 that adds a close method to explicitly free the model.
I am running llama_model._model.del() per the above comment, and I am still seeing the process use cuda ram.
Has there been any movement on creating a proper close method?
Llama
class has close
method now, and the following code should free up RAM:
from llama_cpp import Llama
llama_model = Llama(…)
...
llama_model.close()
Thank you!!!