[BUG]: GenerateAsync via the IEmbeddingGenerator interface throws ObjectDisposedException on LLama.Native.SafeLLamaContextHandle
Description
When creating embeddings via the Microsoft.Extensions.AI IEmbeddingGenerator interface the underlying LLama.Native.SafeLLamaContextHandle throws an object disposed exception.
Which makes a bit of sense when you look through how the LlamaEmbedder works, it creates a new context for each call of GetEmbeddingsWithTokenCount, which it then disposes before returning.
So the call here https://github.com/SciSharp/LLamaSharp/blob/master/LLama/LLamaEmbedder.EmbeddingGenerator.cs#L46 will always hits a disposed context.
Reproduction Steps
Create a LlamaEmbedder and generate an embedding through the IEmbeddingGenerator interface methods.
ModelParams model = ...
var weights = await LLamaWeights.LoadFromFileAsync(model);
var embedder = new LLamaEmbedder(weights, model);
// throws
_ = await embedder.GenerateAsync("What is 1 + 1?");
// works
_ = await embedder.GetEmbeddings("What is 1 + 1?");
Environment & Configuration
- Operating system: Windows 11
- .NET runtime version: dotnet 9
- LLamaSharp version: 0.25.0
- CUDA version (if you are using cuda backend):
- CPU & GPU device: CPU
Known Workarounds
Downgrade to 0.24.0
可以尝试退回 0.24.0 版本。
@misoinc thanks mate, 0.24.0 works! Have update the Known Workarounds.
The https://github.com/SciSharp/LLamaSharp/pull/1183 was more breaking than just the GetService method as commented here.
@martindevans and @zsogitbe I don't have enough context to fully understand why creating and disposing the context on every call to GetEmbedding is preferable to having the LlamaEmbedder own the lifetime of the context it previously created in the constructor.
I had assumed (possibly incorrectly) that creating the context was heavy and that you want as few of them as possible created. To that end a version of the constructor that took in a context might make sense, then the caller can manage the lifetime of the context externally.
I can provide some time to contribute to a fix around this, just wanted to get some more information around the change and understand the limitations and design considerations that went into them.
@bmazzarol-bunnings](https://github.com/bmazzarol-bunnings),
There are two main reasons we avoid creating and persisting context:
-
GPU memory constraints - Context data can easily reach sizes of 1GB depending on the application. Since GPU memory is both limited and expensive, retaining such large contexts can lead to inefficient resource usage. Creating the context does not take much time.
-
Startup behavior of dependent libraries – Many libraries used in combination with LLamaSharp (e.g., SemanticKernel) instantiate both the text generation and embedding services during startup. Persisting context in this scenario would unnecessarily consume GPU resources, potentially preventing other models from fitting into memory and increasing operational costs - especially in cloud environments.
The memory-efficient context handling PR specifically addresses these concerns by optimizing how context is managed. But, as mentioned Microsoft.Extensions.AI.IEmbeddingGenerator would need more work (I have never used that interface, so not sure if anyone needs it).
https://github.com/SciSharp/LLamaSharp/issues/1247
2. Startup behavior of dependent libraries – Many libraries used in combination with LLamaSharp (e.g., SemanticKernel) instantiate both the text generation and embedding services during startup. Persisting context in this scenario would unnecessarily consume GPU resources, potentially preventing other models from fitting into memory and increasing operational costs - especially in cloud environments.
But even when processing a document, they need to call the embed creation method many times in a row, often hundreds of times. And for each such call, the context will be recreated - this slows down the work greatly!
And by the way, what prevents you from creating a new class for such a case, which will create an Embedder and execute it when executed. But you can't just focus on services!
@zsogitbe I would suggest this library maintain support for the abstractions as I would wager the community will be coding against them, not the concrete classes defined here. For more information see https://learn.microsoft.com/en-us/dotnet/ai/microsoft-extensions-ai and https://devblogs.microsoft.com/dotnet/introducing-microsoft-extensions-ai-preview/
I would also take note of what @aropb has outlined above, the way LlamaEmbedder is now, it is not going to work for our background RAG processing.
Ideally, you should create a context once before each such batch. Then perform consecutive method calls and then delete the context. Clearly there is an architectural problem here!
As a result, I have to rewrite many classes: Reranker, Embedder, etc.
Would be good to get @stephentoub take on this as he is the preeminent performance, dotnet and library design expert, and this library was lucky enough to have him contribute the code that bought in the integration to the MS abstractions.
I know this comment is cheeky of me, but I am a huge fan and have read every word of his performance blogs and respect him greatly.
Would be good to get @stephentoub take on this as he is the preeminent performance, dotnet and library design expert, and this library was lucky enough to have him contribute the code that bought in the integration to the MS abstractions.
I know this comment is cheeky of me, but I am a huge fan and have read every word of his performance blogs and respect him greatly.
Hi. What is the question?
We are at impasse here. The code as it is now means that for every call to GetEmbeddings a context is created and disposed. All wrapping code in the LlamaEmbedder that needed to access the context will now fail. Which fundamentally breaks the methods exposed by the MS abstractions you added @stephentoub .
In a non breaking way I would like to make it work again.
My preference would be to revert the code back to how it was in 0.24.0, and add a new constructor that takes a context. That way if you need to limit the number of contexts created in your application you can. Which I hope can alleviate the issues with high GPU usage where multiple contexts exist for both embedding and text generation as outlined by @zsogitbe .
Then if your use case requires the context to be disposed of after each embedding generation, you can take ownership and do it.
I can work on making these changes as soon as there is some agreement.
aropb & bmazzarol (answer also to: https://github.com/SciSharp/LLamaSharp/issues/1247#issuecomment-3343140018)
I disagree with many of your statements. In my view, context is created only when needed, and unused contexts are not retained unnecessarily, which avoids wasting GPU memory. Creating a context takes just a few milliseconds. I'm also not convinced that clearing a context is faster than creating a new one - you could easily measure that if you're curious. You may have a different problem in your code.
The only issue I see is the one I mentioned in this pull request: https://github.com/SciSharp/LLamaSharp/pull/1183, regarding incompatibility with Microsoft.Extensions.AI.IEmbeddingGenerator. Microsoft has no incentive to conserve GPU memory - they have vast resources in the cloud and tend to use as much memory as possible. That’s why they load everything they can. Wasting GPU memory has become mainstream, especially in Python libraries.
I'm also not a fan of the excessive use of interfaces they promote. When a library follows those patterns, it often ends up trapped in inefficient implementations.
Ok @zsogitbe let's try and agree on one thing, the cost of creating a context.
One of the core use cases for the embedding generator is in RAG ingestion pipelines. We have to ingest batches of changes in a streaming way into a read replica, creating embeddings on each item for storage alongside the data for use in semantic search.
Here is the performance for batches of 1, 10 and 100, using the current 0.25.0.
// * Detailed results *
LlamaEmbedderBenchmarks.EmbedBatch1: DefaultJob
Runtime = .NET 8.0.11 (8.0.1124.51707), X64 RyuJIT AVX2; GC = Concurrent Workstation
Mean = 22.028 ms, StdErr = 0.323 ms (1.46%), N = 95, StdDev = 3.145 ms
Min = 17.611 ms, Q1 = 19.729 ms, Median = 21.057 ms, Q3 = 24.076 ms, Max = 30.815 ms
IQR = 4.347 ms, LowerFence = 13.209 ms, UpperFence = 30.596 ms
ConfidenceInterval = [20.932 ms; 23.125 ms] (CI 99.9%), Margin = 1.096 ms (4.98% of Mean)
Skewness = 0.86, Kurtosis = 3.03, MValue = 2.71
-------------------- Histogram --------------------
[16.707 ms ; 17.887 ms) | @@
[17.887 ms ; 19.697 ms) | @@@@@@@@@@@@@@@@@@@@@@
[19.697 ms ; 20.448 ms) | @@@@@@@
[20.448 ms ; 22.257 ms) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[22.257 ms ; 24.344 ms) | @@@@@@@@
[24.344 ms ; 26.154 ms) | @@@@@@@@@@@@@@
[26.154 ms ; 28.011 ms) | @@@@@@
[28.011 ms ; 29.319 ms) |
[29.319 ms ; 31.129 ms) | @@@@
---------------------------------------------------
LlamaEmbedderBenchmarks.EmbedBatch10: DefaultJob
Runtime = .NET 8.0.11 (8.0.1124.51707), X64 RyuJIT AVX2; GC = Concurrent Workstation
Mean = 236.827 ms, StdErr = 3.953 ms (1.67%), N = 96, StdDev = 38.728 ms
Min = 198.499 ms, Q1 = 209.291 ms, Median = 220.005 ms, Q3 = 246.183 ms, Max = 337.610 ms
IQR = 36.892 ms, LowerFence = 153.953 ms, UpperFence = 301.521 ms
ConfidenceInterval = [223.404 ms; 250.250 ms] (CI 99.9%), Margin = 13.423 ms (5.67% of Mean)
Skewness = 1.23, Kurtosis = 3.23, MValue = 2.41
-------------------- Histogram --------------------
[187.398 ms ; 199.949 ms) | @
[199.949 ms ; 222.152 ms) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[222.152 ms ; 247.093 ms) | @@@@@@@@@@@@@@@@@@@@@@
[247.093 ms ; 270.496 ms) | @@@
[270.496 ms ; 292.698 ms) | @@@@@@@@@
[292.698 ms ; 302.146 ms) |
[302.146 ms ; 324.348 ms) | @@@@@@@
[324.348 ms ; 348.711 ms) | @@@@
---------------------------------------------------
LlamaEmbedderBenchmarks.EmbedBatch50: DefaultJob
Runtime = .NET 8.0.11 (8.0.1124.51707), X64 RyuJIT AVX2; GC = Concurrent Workstation
Mean = 632.893 ms, StdErr = 3.568 ms (0.56%), N = 57, StdDev = 26.935 ms
Min = 586.910 ms, Q1 = 615.480 ms, Median = 630.105 ms, Q3 = 652.848 ms, Max = 707.415 ms
IQR = 37.368 ms, LowerFence = 559.428 ms, UpperFence = 708.899 ms
ConfidenceInterval = [620.503 ms; 645.283 ms] (CI 99.9%), Margin = 12.390 ms (1.96% of Mean)
Skewness = 0.32, Kurtosis = 2.67, MValue = 3.11
-------------------- Histogram --------------------
[586.523 ms ; 604.895 ms) | @@@@@@@@@@@@@
[604.895 ms ; 631.978 ms) | @@@@@@@@@@@@@@@@
[631.978 ms ; 646.011 ms) | @@@@@@
[646.011 ms ; 664.382 ms) | @@@@@@@@@@@@@@@@@@
[664.382 ms ; 692.188 ms) | @@
[692.188 ms ; 710.559 ms) | @@
---------------------------------------------------
// * Summary *
BenchmarkDotNet v0.15.2, Windows 11 (10.0.26100.4652/24H2/2024Update/HudsonValley)
Intel Core i7-10850H CPU 2.70GHz (Max: 2.71GHz), 1 CPU, 12 logical and 6 physical cores
.NET SDK 9.0.100
[Host] : .NET 8.0.11 (8.0.1124.51707), X64 RyuJIT AVX2
DefaultJob : .NET 8.0.11 (8.0.1124.51707), X64 RyuJIT AVX2
| Method | Mean | Error | StdDev | Median | Allocated |
|------------- |----------:|----------:|----------:|----------:|-----------:|
| EmbedBatch1 | 22.03 ms | 1.096 ms | 3.145 ms | 21.06 ms | 22.5 KB |
| EmbedBatch10 | 236.83 ms | 13.423 ms | 38.728 ms | 220.00 ms | 223.62 KB |
| EmbedBatch50 | 632.89 ms | 12.390 ms | 26.935 ms | 630.10 ms | 1121.67 KB |
// * Warnings *
MultimodalDistribution
LlamaEmbedderBenchmarks.EmbedBatch50: Default -> It seems that the distribution can have several modes (mValue = 3.11)
// * Hints *
Outliers
LlamaEmbedderBenchmarks.EmbedBatch1: Default -> 5 outliers were removed (32.35 ms..36.15 ms)
LlamaEmbedderBenchmarks.EmbedBatch10: Default -> 4 outliers were removed (348.86 ms..396.41 ms)
LlamaEmbedderBenchmarks.EmbedBatch50: Default -> 16 outliers were removed (768.87 ms..1.02 s)
// * Legends *
Mean : Arithmetic mean of all measurements
Error : Half of 99.9% confidence interval
StdDev : Standard deviation of all measurements
Median : Value separating the higher half of all measurements (50th percentile)
Allocated : Allocated memory per single operation (managed only, inclusive, 1KB = 1024B)
1 ms : 1 Millisecond (0.001 sec)
// * Diagnostic Output - MemoryDiagnoser *
// ***** BenchmarkRunner: End *****
Run time: 00:03:50 (230.98 sec), executed benchmarks: 3
Global total time: 00:04:17 (257.87 sec), executed benchmarks: 3
And here it is again if I roll back the code to 0.24.0,
// * Detailed results *
LlamaEmbedderBenchmarks.EmbedBatch1: DefaultJob
Runtime = .NET 8.0.11 (8.0.1124.51707), X64 RyuJIT AVX2; GC = Concurrent Workstation
Mean = 8.688 ms, StdErr = 0.018 ms (0.20%), N = 14, StdDev = 0.066 ms
Min = 8.611 ms, Q1 = 8.644 ms, Median = 8.656 ms, Q3 = 8.742 ms, Max = 8.814 ms
IQR = 0.098 ms, LowerFence = 8.496 ms, UpperFence = 8.890 ms
ConfidenceInterval = [8.614 ms; 8.762 ms] (CI 99.9%), Margin = 0.074 ms (0.85% of Mean)
Skewness = 0.58, Kurtosis = 1.77, MValue = 2
-------------------- Histogram --------------------
[8.575 ms ; 8.850 ms) | @@@@@@@@@@@@@@
---------------------------------------------------
LlamaEmbedderBenchmarks.EmbedBatch10: DefaultJob
Runtime = .NET 8.0.11 (8.0.1124.51707), X64 RyuJIT AVX2; GC = Concurrent Workstation
Mean = 75.220 ms, StdErr = 0.477 ms (0.63%), N = 96, StdDev = 4.675 ms
Min = 69.180 ms, Q1 = 71.887 ms, Median = 73.920 ms, Q3 = 77.623 ms, Max = 86.841 ms
IQR = 5.735 ms, LowerFence = 63.284 ms, UpperFence = 86.226 ms
ConfidenceInterval = [73.600 ms; 76.840 ms] (CI 99.9%), Margin = 1.620 ms (2.15% of Mean)
Skewness = 0.95, Kurtosis = 3.1, MValue = 2.62
-------------------- Histogram --------------------
[68.881 ms ; 71.758 ms) | @@@@@@@@@@@@@@@@@@@@@@
[71.758 ms ; 74.438 ms) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[74.438 ms ; 78.387 ms) | @@@@@@@@@@@@@@@@@@@@@
[78.387 ms ; 81.744 ms) | @@@@@@@@@@
[81.744 ms ; 84.194 ms) | @@
[84.194 ms ; 86.874 ms) | @@@@@@@@
---------------------------------------------------
LlamaEmbedderBenchmarks.EmbedBatch50: DefaultJob
Runtime = .NET 8.0.11 (8.0.1124.51707), X64 RyuJIT AVX2; GC = Concurrent Workstation
Mean = 411.663 ms, StdErr = 1.775 ms (0.43%), N = 14, StdDev = 6.643 ms
Min = 400.598 ms, Q1 = 408.733 ms, Median = 413.265 ms, Q3 = 416.343 ms, Max = 421.615 ms
IQR = 7.610 ms, LowerFence = 397.318 ms, UpperFence = 427.757 ms
ConfidenceInterval = [404.170 ms; 419.157 ms] (CI 99.9%), Margin = 7.494 ms (1.82% of Mean)
Skewness = -0.48, Kurtosis = 1.91, MValue = 2
-------------------- Histogram --------------------
[396.980 ms ; 405.327 ms) | @@@
[405.327 ms ; 425.233 ms) | @@@@@@@@@@@
---------------------------------------------------
// * Summary *
BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.4652)
Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK 9.0.100
[Host] : .NET 8.0.11 (8.0.1124.51707), X64 RyuJIT AVX2
DefaultJob : .NET 8.0.11 (8.0.1124.51707), X64 RyuJIT AVX2
| Method | Mean | Error | StdDev | Median | Allocated |
|------------- |-----------:|----------:|----------:|-----------:|-----------:|
| EmbedBatch1 | 8.688 ms | 0.0741 ms | 0.0657 ms | 8.656 ms | 22.42 KB |
| EmbedBatch10 | 75.220 ms | 1.6202 ms | 4.6747 ms | 73.920 ms | 222.52 KB |
| EmbedBatch50 | 411.663 ms | 7.4940 ms | 6.6432 ms | 413.265 ms | 1117.98 KB |
// * Hints *
Outliers
LlamaEmbedderBenchmarks.EmbedBatch1: Default -> 1 outlier was removed (8.98 ms)
LlamaEmbedderBenchmarks.EmbedBatch10: Default -> 4 outliers were removed (87.85 ms..89.82 ms)
LlamaEmbedderBenchmarks.EmbedBatch50: Default -> 2 outliers were removed (435.35 ms, 444.51 ms)
// * Legends *
Mean : Arithmetic mean of all measurements
Error : Half of 99.9% confidence interval
StdDev : Standard deviation of all measurements
Median : Value separating the higher half of all measurements (50th percentile)
Allocated : Allocated memory per single operation (managed only, inclusive, 1KB = 1024B)
1 ms : 1 Millisecond (0.001 sec)
// * Diagnostic Output - MemoryDiagnoser *
// ***** BenchmarkRunner: End *****
Run time: 00:01:48 (108.1 sec), executed benchmarks: 3
Global total time: 00:02:27 (147.95 sec), executed benchmarks: 3
Here they are side by side so you can see,
// 0.25.0
| Method | Mean | Error | StdDev | Median | Allocated |
|------------- |----------:|----------:|----------:|----------:|-----------:|
| EmbedBatch1 | 22.03 ms | 1.096 ms | 3.145 ms | 21.06 ms | 22.5 KB |
| EmbedBatch10 | 236.83 ms | 13.423 ms | 38.728 ms | 220.00 ms | 223.62 KB |
| EmbedBatch50 | 632.89 ms | 12.390 ms | 26.935 ms | 630.10 ms | 1121.67 KB |
// 0.24.0
| Method | Mean | Error | StdDev | Median | Allocated |
|------------- |-----------:|----------:|----------:|-----------:|-----------:|
| EmbedBatch1 | 8.688 ms | 0.0741 ms | 0.0657 ms | 8.656 ms | 22.42 KB |
| EmbedBatch10 | 75.220 ms | 1.6202 ms | 4.6747 ms | 73.920 ms | 222.52 KB |
| EmbedBatch50 | 411.663 ms | 7.4940 ms | 6.6432 ms | 413.265 ms | 1117.98 KB |
- Batch 1: [((22.03 - 8.688) / 8.688) * 100 ≈ 153.6%] increase
- Batch 10: [((236.83 - 75.22) / 75.22) * 100 ≈ 214.8%] increase
- Batch 50: [((632.89 - 411.663) / 411.663) * 100 ≈ 53.7%] increase (the sample size is very low here; my PC was struggling)
Creating a context on each embed call increases mean execution time by ~154% (batch 1), ~215% (batch 10), and ~54% (batch 50) compared to creating it once in the constructor. That IMHO is not a small cost.
bmazzarol-bunnings, nice analysis! I have a few questions:
- Do you use GPU or CPU?
- It should not make any difference for EmbedBatch1 if you pre-allocate the context or allocate it later, so those results must be the same. Why do you have in 0.25.0
22 msand9 msin 0.24.0? There is something wrong here! - There is a very strange variation in your results that is not logical. The increase in processing time should be linear if you increase the batch size. Why these strange jumps in the numbers? I have the feeling that something is wrong here, but I do not think that it is the allocation of the context.
aropb & bmazzarol (answer also to: #1247 (comment))
I disagree with many of your statements. In my view, context is created only when needed, and unused contexts are not retained unnecessarily, which avoids wasting GPU memory. Creating a context takes just a few milliseconds. I'm also not convinced that clearing a context is faster than creating a new one - you could easily measure that if you're curious. You may have a different problem in your code.
The only issue I see is the one I mentioned in this pull request: #1183, regarding incompatibility with
Microsoft.Extensions.AI.IEmbeddingGenerator. Microsoft has no incentive to conserve GPU memory - they have vast resources in the cloud and tend to use as much memory as possible. That’s why they load everything they can. Wasting GPU memory has become mainstream, especially in Python libraries. I'm also not a fan of the excessive use of interfaces they promote. When a library follows those patterns, it often ends up trapped in inefficient implementations.
After upgrading to LLamaSharp 0.25.0, I did not immediately notice that GPU utilization during embeddings generation dropped to 10%, instead of 30-40% on version 0.24.0. At the same time, the embeddings creation time increased by about 2 times. I am currently using LLamaSharp directly. Therefore, I am sure that the changes are related to the re-creation of the context. As soon as I moved the context creation to the constructor, everything returned as it was. You are wrong to think that allocating 1-2 GB of VRAM 100 times in a row is a fast operation.
And what prevents you from checking all this if you don't believe it?
In general, by recreating the context, you violate the classic rules of developing effective code. If there is a massive operation that is performed many times in a row, it is better to do everything possible, especially resource allocation, 1 time before that.
Feel free to run it @zsogitbe I have linked to the source for both.
The batch 1 case is slower because of the context. The cost of construction is not counted in the benchmark code. Feel free to move the construction of the embedder into the benchmark method, don't forget to count the cost of disposal as well. Then you will get around the same time.
But it changes very little, the assertion that context construction is cheap is not being replicated in the benchmark code.
When the context is frequently recreated, the probability of a context allocation error on the GPU increases. Since this operation is in llama.cpp it doesn't seem to be multithreaded.
bmazzarol, Let's assume your benchmark is accurate and it takes 22ms to create EmbedBatch1. Of that time, the majority is likely spent on generating the embedding itself, so we can estimate around 5ms (0.005 sec) for context allocation - which is quite fast.
For typical applications, it's generally better not to pre-allocate the context, as this can lead to unnecessary GPU memory usage. However, in specialized scenarios - say, if your application generates millions of embeddings daily and does nothing else - it might be worth exploring ways to optimize embedding creation. In that case, my advice would be to approach context reuse with caution. Reusing context for subsequent embedding generations can introduce subtle issues, so it's important to validate that it behaves reliably under your workload.
It is important to make a comparison on the GPU, for a text of at least 1000 tokens, with a context of 2048 (batchsize=2048).
LLM: https://huggingface.co/Qwen/Qwen3-Embedding-0.6B-GGUF/blob/main/Qwen3-Embedding-0.6B-f16.gguf
It looks like the main disagreement here is a fundamental difference in caring about preserving memory or time, correct me if I'm wrong. Given that, it seems like the correct fix is to accept a context as a parameter in the constructor of the embedder. That way:
- If you want to save memory, you can dispose and recreated the context+embedder for every request.
- If you want to save time, you can create the context+embedder once, and re-use it for every request.
(I think aropb already suggested this somewhere else)
I'm in favor of saving VRAM (this is really important), but not at the expense of performance. But I want to control my own resources. If I know that I'm going to have 100 calls in a row right now, then I create a context once before that. It also saves memory and I don't lose performance.
I don't see the point in passing the context to the constructor. It's better to make it a general rule: create a context in the constructor, once. When infer context cleaning. When using libraries like KernelMemory, you need to create a new class that will create Embedder each time during output (if you really want to save memory). That's all.
It looks like the main disagreement here is a fundamental difference in caring about preserving memory or time, correct me if I'm wrong. Given that, it seems like the correct fix is to accept a context as a parameter in the constructor of the embedder. That way:
- If you want to save memory, you can dispose and recreated the context+embedder for every request.
- If you want to save time, you can create the context+embedder once, and re-use it for every request.
(I think aropb already suggested this somewhere else)
Yes this is the solution I would like to see, and would be happy to implement.
I don't see the point in passing the context to the constructor.
The idea is to allow you to share the context even more widely if you wish, leaving you in control of resources. For example you could have a single context created at the application startup that is used for everything.
I don't see any such scenarios for myself.