LLamaSharp [BUG]: Drop in performance of creating embeddings for chunks

Description

Please keep in mind that embeddings for chunks are created many times for a single document. You wrote the code so that the context is constantly being recreated. This is very inefficient. All this leads to the fact that the GPU load drops to 10%, performance drops sharply and the time for creating embeddings on chunks of documents increases (x2). This is very critical.

Such logic is clearly not required, but only worsens performance:

Context = weights.CreateContext(@params, logger);
Context.Dispose();

Known Workarounds

LLamaSharp 0.25.0 In the corrected version code:

public class LLamaSharpEmbedder : IDisposable { public int EmbeddingSize { get; private set; } public LLamaContext Context { get; private set; }

//private readonly LLamaWeights _weights;
//private readonly IContextParams _params;
//private readonly ILogger _logger;

/// <summary>
/// Create a new embedder, using the given LLamaWeights
/// </summary>
/// <param name="weights"></param>
/// <param name="params"></param>
/// <param name="logger"></param>
public LLamaSharpEmbedder(LLamaWeights weights, IContextParams @params, ILogger logger = null)
{
    if (@params.UBatchSize != @params.BatchSize)
        throw new ArgumentException("For non-causal models, batch size must be equal to ubatch size", nameof(@params));

    if (weights.NativeHandle is { HasEncoder: true, HasDecoder: true })
        throw new NotSupportedException("Computing embeddings in encoder-decoder models is not supported");

    Context = weights.CreateContext(@params, logger);
    NativeApi.llama_set_embeddings(Context.NativeHandle, true);
    EmbeddingSize = Context.EmbeddingSize;
    //Context.Dispose();
    //_weights = weights;
    //_params = @params;
    //_logger = logger;
}

/// <inheritdoc />
public void Dispose()
{
    Context.Dispose();
}

/// <summary>
/// Get high dimensional embedding vectors for the given text. Depending on the pooling type used when constructing
/// this <see cref="LLamaEmbedder"/> this may return an embedding vector per token, or one single embedding vector for the entire string.
/// </summary>
/// <remarks>Embedding vectors are not normalized, consider using one of the extensions in <see cref="SpanNormalizationExtensions"/>.</remarks>
/// <param name="input"></param>
/// <param name="cancellationToken"></param>
/// <returns></returns>
/// <exception cref="RuntimeError"></exception>
/// <exception cref="NotSupportedException"></exception>
public async Task<IReadOnlyList<float[]>> GetEmbeddings(string input, CancellationToken cancellationToken = default) =>
    (await GetEmbeddingsWithTokenCount(input, cancellationToken).ConfigureAwait(false)).Embeddings;

private async Task<(IReadOnlyList<float[]> Embeddings, int Tokens)> GetEmbeddingsWithTokenCount(string input, CancellationToken cancellationToken = default)
{
    // Ensure the context from last time is disposed (it always should be)
    //if (!Context.NativeHandle.IsClosed)
    //  Context.Dispose();

    //Context = _weights.CreateContext(_params, _logger);

    //NativeApi.llama_set_embeddings(Context.NativeHandle, true);

    Context.NativeHandle.MemoryClear();

    // Add all of the tokens to the batch
    var tokens = Context.Tokenize(input, addBos: true, special: true);
    if (tokens.Length > Context.ContextSize)
        throw new ArgumentException($"Embedding prompt is longer than the context window ({tokens.Length} > {Context.ContextSize})", nameof(input));

    // Check if we should cancel the work, just before doing anything expensive (encode/decode)

    cancellationToken.ThrowIfCancellationRequested();

    // Evaluate prompt in batch-size chunks
    var n_past = 0;
    var batch = new LLamaBatch();
    var batchSize = (int)Context.Params.BatchSize;
    for (var i = 0; i < tokens.Length; i += batchSize)
    {
        var n_eval = tokens.Length - i;
        if (n_eval > batchSize)
            n_eval = batchSize;

        batch.Clear();
        batch.AddRange(tokens.AsSpan(i, n_eval), n_past, LLamaSeqId.Zero, true);
        n_past += n_eval;

        // Run model
        switch (Context.NativeHandle.ModelHandle.HasEncoder, Context.NativeHandle.ModelHandle.HasDecoder)
        {
            case (true, false):
                {
                    var result = await Context.EncodeAsync(batch, cancellationToken);
                    if (result != EncodeResult.Ok)
                        throw new RuntimeError($"Failed to encode: {result}");
                    break;
                }

            case (false, true):
                {
                    var result = await Context.DecodeAsync(batch, cancellationToken);
                    if (result != DecodeResult.Ok)
                        throw new RuntimeError($"Failed to decode: {result}");
                    break;
                }

            default:
                throw new NotSupportedException("Unsupported model type");
        }
    }

    // Extract results
    var poolingType = Context.NativeHandle.PoolingType;
    var resultsCount = poolingType == LLamaPoolingType.None ? tokens.Length : 1;
    var results = new List<float[]>(resultsCount);

    if (poolingType == LLamaPoolingType.None)
    {
        /// method GetLogitPositions() is unavailable !!!!!

        /*
        var positions = batch.GetLogitPositions();
        foreach (var (_, pos) in positions)
            results.Add(Context.NativeHandle.GetEmbeddingsIth(pos).ToArray());
        */
    }
    else
    {
        results.Add(Context.NativeHandle.GetEmbeddingsSeq(LLamaSeqId.Zero).ToArray());
    }

    // Normalize the embeddings vector
    // https://github.com/ggerganov/llama.cpp/blob/2891c8aa9af17f4ff636ff3868bc34ff72b56e25/examples/embedding/embedding.cpp#L92
    foreach (var embedding in results)
    {
        embedding.EuclideanNormalization();
    }

    //Context.Dispose();

    return (results, tokens.Length);
}

}

Sep 07 '25 15:09 aropb

@martindevans I suggest using the following approach wherever context is used:

A class from IDisposable
Creating a context is always in the constructor (1 time).
InferAsync or something else: execute Context.NativeHandle.MemoryClear();

Why this is important: the developer himself will manage when to create the object. But if class actions are called many, many times in a row (for example, when summarizing or creating embeddings), the context will not be constantly recreated. This will give everyone a significant gain in speed.

I rewrote all the output methods according to this scheme and everything works fine and fast!

It would also be very good to add multi-thread context creation/deletion protection at the Native level.

Sep 12 '25 13:09 aropb

Please check out the comments in https://github.com/SciSharp/LLamaSharp/issues/1259 This is not a bug, but efficient context handling and we really need it like this as standard behavior. You will need a different solution, a text embedding service that you instantiate with a context for the lifetime of the service.

Sep 28 '25 10:09 zsogitbe

LlamaSharp is a great product, it would be right if the rules for using context were the same everywhere and allowed the developer to choose from two options:

creating a context in the constructor for the entire lifetime of the object (with context clearing before calling methods)
creating a context directly when calling methods

It is important that the approach is the same everywhere - the standard. What has been done now only works well for one scenario!

I am sure that all the developers will support me.

Besides, this is clearly not required.: ... Context = weights.CreateContext() Context.Dispose(); ...

Sep 28 '25 12:09 aropb

@zsogitbe

I thought, in my opinion, we still need to leave one basic behavior of the LLamaSharp base classes. Creating a context in the constructor, clearing the context in the output method. This is enough to implement all other usage scenarios. For services, you will need to create special classes that will use the base classes as they need to. For example, create the LLamaSharp base class locally and do what they need to do.

It is important that all classes have the same behavior. I consider the current behavior, when a context is always created during execution, to be erroneous!

I think I can talk about this taking into account the year of practical use of LLamaSharp.

Sep 28 '25 15:09 aropb

Besides, this is clearly not required

It is unfortunate but it is required with the current design. The embedder has this property:

public LLamaContext Context { get; private set; }

Which must be non-null once the constructor returns. If we got rid of that property, or made it nullable, the context creation/disposal in the constructor could be removed.

Sep 28 '25 16:09 martindevans

I used KernelMemory and there were a lot of performance and manageability issues. But now I've completely gotten rid of it and I'm happy. I think LLamasharp should not implement interfaces for various frameworks at the basic level at once. This creates problems. LLamaSharp should be as simple as possible (nothing superfluous), flexible (for various scenarios), manageable, and as productive as possible (both in terms of memory and output speed) at a basic level. Everything else is the next level.

Sep 28 '25 16:09 aropb

public LLamaContext Context { get; private set; }

For direct use of LLAmaSharp, this is completely unnecessary. I'm fine without it. Moreover, in any case, accessing the disposed context will cause an error.

Sep 28 '25 17:09 aropb

Besides, this is clearly not required

It is unfortunate but it is required with the current design. The embedder has this property:

public LLamaContext Context { get; private set; } Which must be non-null once the constructor returns. If we got rid of that property, or made it nullable, the context creation/disposal in the constructor could be removed.

By the way, if you always move the creation of the context in the constructor, then this hack will not be needed. When executed, only clear the context. For services and frameworks, you just need to create next-level classes.

Sep 29 '25 09:09 aropb

This issue has been automatically marked as stale due to inactivity. If no further activity occurs, it will be closed in 7 days.

Nov 29 '25 00:11 github-actions[bot]