Expected Behavior

From the issue #302 , I expected the model to be unloaded with the following function:


def unload_model():
    global llm
    llama_free_model(llm)
    # Delete the model object
    del llm
    llm = None  # Ensure no reference remains
    
    # Explicitly invoke the garbage collector
    gc.collect()

    return {"message": "Model unloaded successfully"}

However, there are two problems here:

1 - Using llama_free_model with the object llm (which is conventionally loaded) is resulting in this:

Traceback (most recent call last):
  File "/run/media/myserver/5dcc41df-7194-4e57-a28f-833dc5ce81bb/llamacpp/app.py", line 48, in <module>
    llama_free_model(llm)
ctypes.ArgumentError: argument 1: TypeError: wrong type

'llm' is generated with this:

llm = Llama(
    model_path=model_path,
    chat_handler=chat_handler,
    n_gpu_layers=gpu_layers,
    n_ctx=n_ctx
)

2 - Even after deleting the object, assigning as None and invoking the garbage collection, the VRAM is still not freed. The VRAM only gets cleared after I kill the app along all of its processes and threads.

Current Behavior

1- llama_free_model does not work. 2 - Garbage collection not freeing up VRAM.

Environment and Context

I tried this on both an Arch Linux setup with an RTX 3090 and a Windows laptop with an eGPU. This problem was consistent on those two different OSes and different hardware setups.

Physical (or virtual) hardware you are using, e.g. for Linux:

AMD Ryzen 7 2700 Eight-Core Processor NVIDIA GeForce RTX 3090

Operating System, e.g. for Linux:

Arch Linux 6.8.9-arch1-1 Windows 11

Python 3.12.3
GNU Make 4.4.1
g++ (GCC) 13.2.1 20240417

Failure Information (for bugs)

Traceback (most recent call last):
  File "/run/media/myserver/5dcc41df-7194-4e57-a28f-833dc5ce81bb/llamacpp/app.py", line 48, in <module>
    llama_free_model(llm)
ctypes.ArgumentError: argument 1: TypeError: wrong type

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

Perform a free install of llama-cpp-python, with CUDA support
Write a code snippet to load the model as usual
Try to use llama_free_model to unload the model, or delete the model object and invoke garbage collection
Make sure to keep the app running afterwards and check VRAM with nvidia-smi

May 10 '24 00:05 Baquara

From what I can see, llama_free_model is expected to take a lower-level object instead of the Llama object. In Python, determining when the garbage collector actually deletes an object is not straightforward. Here is a workaround that forces the release of the loaded model:

from llama_cpp import Llama

llama_model = Llama(…)

# Explicitly delete the model's internal object
llama_model._model.__del__()

This approach has worked for me so far.

Jun 04 '24 18:06 jkawamoto

In my experience, @jkawamoto approach is a good one, because it frees RAM/CUDA/other memory, even if the Llama object is stuck.

I've tried calling del llama_model, but this is not guaranteed to actually call __del__ if there are references to the object (and this can happen in several cases, like for example from uncaught exceptions in interactive environments like Jupyterlab - see here )

Jun 04 '24 21:06 jndiogo

Since calling a special method (__del__) of a private field is too ad hoc, I opened a PR #1513 that adds a close method to explicitly free the model.

Jun 06 '24 09:06 jkawamoto

I am running llama_model._model.del() per the above comment, and I am still seeing the process use cuda ram.

Has there been any movement on creating a proper close method?

Aug 10 '24 18:08 redshiva

Llama class has close method now, and the following code should free up RAM:

from llama_cpp import Llama

llama_model = Llama(…)
...

llama_model.close()

Aug 11 '24 03:08 jkawamoto

Thank you!!!

Aug 12 '24 04:08 redshiva

llama-cpp-python
llama-cpp-python copied to clipboard

Models are failing to be properly unloaded and freeing up VRAM

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Steps to Reproduce

llama-cpp-python llama-cpp-python copied to clipboard

Models are failing to be properly unloaded and freeing up VRAM

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Steps to Reproduce

llama-cpp-python
llama-cpp-python copied to clipboard