LLamaSharp [BUG]: When using large models with the GPU the code crashes with cannot allocate kvcache

Description

Let us assume that we have two models, model A and model B. We use model A with the GPU and keeping it active (not disposing it), then using model B with the GPU and keeping it active (not disposing it), if we now try to inference with model A the library crashes because model B filled the rest of the GPU model A is using and thus model A cannot allocate memory for the kvcache (unless there is more than enough space for both models in one GPU).

Update: I have further investigated the issue and the library crashes even when using only 1 model which does not fully fit into the GPU memory. The problem is the additional GPU memory allocation because when the GPU memory is full it crashes when it tries to allocate the context when doing the first inference...

May 28 '24 13:05 zsogitbe

What would be your expected behaviour here (given that it's not possible to keep both models loaded at once)?

May 29 '24 12:05 martindevans

I would expect that the model keeps its kvcache GPU memory space and that it just needs to reset it without needing to reallocate it. The model should not need to reallocate GPU memory after loading it with a specific set of parameters (context size, ...), unless I am missing something.

The problem is that both libraries (c++, c#) are not designed for more than one model, it is assumed that a model is loaded and then unloaded before loading a new model.

May 29 '24 13:05 zsogitbe

Are you keeping the context loaded in memory, or just the model weights (LLamaContext vs LLamaWeights)?

As far as I know if you keep the context around and don't dispose it then you should be able to use at the same time as another context for another model. Most things are allocated up front when the context is first created, so I wouldn't expect that to fail (assuming you have sufficient GPU memory of course).

May 29 '24 13:05 martindevans

Everything is kept after loading the model. The kvcache is being allocated every time I do inference. This is the problem.

You can test it by using 2 models which do not fit in your GPU. Model A will fit and then model B will only partially fit (this is the usual situation with everyone). After loading model B model A cannot do inference anymore because it crashes when it wants to allocate kvcache.

May 29 '24 18:05 zsogitbe

My understanding is that the KV cache is allocated when the LLamaContext is created and never grows (i.e. nothing gets re-allocated later).

Can you show some code that demonstrates the issue, along with the exact error you get?

May 29 '24 20:05 martindevans

I have created a test program and I could further narrow down the problem. So, the crash occurs when you load model A, but do not use it immediately, but load model B. When you now try to use modelA for the first time after loading model B it always allocates extra GPU memory (kv cache) and because there is no more GPU memory in the GPU where it is based on the code crashes.

In release mode you will see:

MODEL B:
=======
Write a short fairy tail with 50 words by starting with the following:
"Once upon a time many years ago there was a little girl"

Once upon a time many years ago there was a little girl named Rose who lived in a quaint cottage surrounded by 
enchanted woods. She longed to explore but her mother warned of dangers lurking within. 
One day, a kindly old fairy appeared and gifted Rose with magical boots that allowed safe passage through the forest.

The end.
>
MODEL A:
=======
Write a short fairy tail with 50 words by starting with the following:
"Once upon a time many years ago there was a little girl"

CUDA error: out of memory
  current device: 0, in function alloc at llama.cpp\ggml-cuda.cu:320
  cuMemSetAccess(pool_addr + pool_size, reserve_size, &access, 1)
GGML_ASSERT: llama.cpp\ggml-cuda.cu:60: !"CUDA error"

So, one workaround could be to do a dummy inference with model A to let the library allocate all memory it needs. This could also be done in the library that the users get a "fully loaded" model.

As a summary, the library always allocates extra GPU memory when the model is used for the first time (inference), so not all GPU memory is allocated when you load the weights and create the context.

May 30 '24 13:05 zsogitbe

Can you show the test program?

At the moment it sounds to me like it's working exactly as expected - loading the model (i.e. LLamaWeights) allocates some GPU memory, creating a context (usually done automatically when an executor is created) allocates some more GPU memory.

May 30 '24 13:05 martindevans

No, you forgot to mention the last step when it allocates GPU memory once more, when the model is used for the first time. This is not as expected. What is expected is that all allocation is ready when the model is loaded and the context is created.

string modelPath1 = @"path-to-model-A";
string modelPath2 = @"path-to-model-B";
var inferenceParams = new InferenceParams() { MaxTokens = -1 };
string prompt = 
    $$"""
    Write a short fairy tail with 50 words by starting with the following:
    "Once upon a time many years ago there was a little girl"
    """
    ;

//---MODEL A-----------------------
var parameters1 = new ModelParams(modelPath1)
{
    Seed = 1337,
    GpuLayerCount = -1,                
    ContextSize = 4098,
    //Embeddings = true,
};
using var model1 = await LLamaWeights.LoadFromFileAsync(parameters1);
using var context1 = model1.CreateContext(parameters1);
var executor1 = new InstructExecutor(context1);
//var executor1 = new StatelessExecutor(model1, parameters1);           

//---MODEL B-----------------------
var parameters2 = new ModelParams(modelPath2)
{
    Seed = 1337,
    GpuLayerCount = 20,
    ContextSize = 4098,
    SplitMode = Native.GPUSplitMode.Layer,
    //Embeddings = true,
};
using var model2 = await LLamaWeights.LoadFromFileAsync(parameters2);
using var context2 = model2.CreateContext(parameters2);
var executor2 = new InstructExecutor(context2);
//var executor2 = new StatelessExecutor(model2, parameters2);

{
    Console.ForegroundColor = ConsoleColor.Yellow;
    Console.WriteLine(prompt);
    Console.ForegroundColor = ConsoleColor.White;
    await foreach (var text in executor2.InferAsync(prompt.Trim(), inferenceParams))
    {
        Console.Write(text);
    }
    Console.WriteLine("");
    Console.WriteLine("-------------");
}

//---MODEL A-----------------------
{
    Console.ForegroundColor = ConsoleColor.Yellow;
    Console.WriteLine(prompt);
    Console.ForegroundColor = ConsoleColor.White;
    await foreach (var text in executor1.InferAsync(prompt.Trim(), inferenceParams))
    {
        Console.Write(text);
    }
    Console.WriteLine("");
    Console.WriteLine("-------------");
}

May 30 '24 14:05 zsogitbe

Aha that shows what I wasn't sure about - whether you're creating both contexts ahead of time or as part of the first inference.

Does this act differently if you use a InstructExecutor vs a StatelessExecutor? The StatelessExecutor is a bit weird - instead of using an existing context it creates a new one for very request.

Also can you identify exactly which line it is crashing on? I assume in your example code it's the second executor1.InferAsync, but if you dig into that what's the very last of C# code on the stack before it crashes (llama_decode I would guess)?

May 30 '24 14:05 martindevans

ggml-cuda.cu crashes after this:

 	llama.dll!llama_graph_compute(llama_context & lctx, ggml_cgraph * gf, int n_threads) Line 11094	C++
 	llama.dll!llama_decode_internal(llama_context & lctx, llama_batch batch_all) Line 11336	C++
 	llama.dll!llama_decode(llama_context * ctx, llama_batch batch) Line 17009	C++
 	[Managed to Native Transition]	
=>	LLamaSharp.dll!LLama.Native.SafeLLamaContextHandle.Decode(LLama.Native.LLamaBatch batch) Line 376	C#

The StatelessExecutor does not crash with this example, but the InstructExecutor crashes. It seems that the StatelessExecutor is architected better or uses less memory.

May 30 '24 14:05 zsogitbe

StatelessExecutor does use less memory, sort of. Every call to infer will allocate a new context, run the inference loop and then dispose the context. While it's running it'll use the same memory as any other executor, but once it's done that memory is immediately freed. Since your test isn't using the stateless executors at the same time there's half as much memory consumed by the context at any given time. Of course other executors use the context you pass in, and that context remains valid until you dispose it.

If the crash is in llama_decode(llama_context * ctx, llama_batch batch) I think you'll need to open an upstream issue to ask about this. There must be extra allocations going on inside the decode calls beyond what's allocated up-front in the context. Depending on what they say your idea of "warming up" the system with a single inference when the model is loaded might be a good idea.

May 30 '24 14:05 martindevans

The GPU memory handling is a sensitive issue at llama.cpp, I have not got an answer to my question why my models use 20% more GPU memory today compared to approximately one month ago. So, I do not think that it will help to mention this.

It is good that we know it here and if you want you can correct the issue with adding a dummy inference call to make sure that all memory is allocated. I think that it would be a good idea to do this in LLamaSharp or at least add an optional parameter to trigger it.

May 31 '24 08:05 zsogitbe

I have further investigated the issue and the library crashes even when using only 1 model which does not fully fit into the GPU memory. The problem is the additional GPU memory allocation because when the GPU memory is full it crashes when it tries to allocate the context when doing the first inference...

I have reported this upstream here: https://github.com/ggerganov/llama.cpp/issues/7885

Jun 13 '24 12:06 zsogitbe

This issue has been automatically marked as stale due to inactivity. If no further activity occurs, it will be closed in 7 days.

May 02 '25 00:05 github-actions[bot]