LLamaSharp [BUG:] When switching to new versions of LLamaSharp 0.16.0, there was a slowdown

Description

After upgrading from KM 0.61 and LLamaSharp 0.13 to the latest versions, the processing ImportDocumentAsync/AskAsync speed has significantly decreased, almost 2 times. All settings and equipment are the same. I see that the GPU is now less loaded, previously CUDA was loaded more than 80%.

Import: Screenshot_1

Ask:

How do I get back to my previous speed? Thanks.

Reproduction Steps

Environment & Configuration

Operating system: Windows 11 .NET runtime version: 8.0.8 LLamaSharp version: 0.16.0 KernelMemory: 0.71... CUDA version (if you are using cuda backend): 12 CPU & GPU device: CPU or GPU, Intel Core Ultra 9 & RTX 3090

Known Workarounds

No response

Sep 04 '24 11:09 aropb

Launched Llama native

Sep 04 '24 15:09 aropb

apparently there was a similar problem https://github.com/SciSharp/LLamaSharp/issues/601

Sep 04 '24 16:09 aropb

GPU utilization is not more than 30%.

This my test code:

        ModelParams = new(Program.AppSettings.ModelPath)
        {
            Embeddings = false,
            ContextSize = 8192,
            Seed = 1337,
            MainGpu = 0,
            GpuLayerCount = 100,
            SplitMode = GPUSplitMode.Layer,
            BatchSize = 2048,
            UBatchSize = 2048
        };

        Weights = LLamaWeights.LoadFromFile(ModelParams);
        StatelessExecutor executor = new(Weights, ModelParams);
        
        InferenceParams inferenceParams = new()
        {
            SamplingPipeline = new DefaultSamplingPipeline()
            {
                Temperature = 0.5f,
                RepeatPenalty = 1.0f
            },
            MaxTokens = 2000,
            AntiPrompts = ["\n"]
        };

        StringBuilder sb = new(2048);

        await foreach (string text in executor.InferAsync(value, inferenceParams, cancellationToken))
        {
            sb.Append(text);
        }

        return sb.ToString();

Log:

  llama_new_context_with_model: n_ctx      = 8192
  llama_new_context_with_model: n_batch    = 2048
  llama_new_context_with_model: n_ubatch   = 2048
  llama_new_context_with_model: flash_attn = 0
  llama_new_context_with_model: freq_base  = 500000.0
  llama_new_context_with_model: freq_scale = 1
  llama_kv_cache_init:      CUDA0 KV buffer size =  1024.00 MiB
  llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
  llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
  llama_new_context_with_model:      CUDA0 compute buffer size =  2240.02 MiB
  llama_new_context_with_model:  CUDA_Host compute buffer size =    96.02 MiB
  llama_new_context_with_model: graph nodes  = 1030
  llama_new_context_with_model: graph splits = 2

Large buffer size: CUDA 0 compute buffer size = 2240.02 MiB

Because n_ubatch=2048, but I can't do less because llamasharp has a check for n_batch==n_ubatch. Why?

public LLamaEmbedder(LLamaWeights weights, IContextParams @params, ILogger? logger = null)
{
    if (@params.UBatchSize != @params.BatchSize)
        throw new ArgumentException("For non-causal models, batch size must be equal to ubatch size", nameof(@params));
    if (weights.NativeHandle is { HasEncoder: true, HasDecoder: true })
        throw new NotSupportedException("Computing embeddings in encoder-decoder models is not supported");

    Context = weights.CreateContext(@params, logger);
    NativeApi.llama_set_embeddings(Context.NativeHandle, true);
}

And I need to install n_batch in 2048, otherwise it gives an error.

Does anyone have the same behavior?

Sep 05 '24 16:09 aropb

What model are you using, how many tokens are you processing at once, and how are you measuring GPU utilisation?

Sep 05 '24 23:09 martindevans

Model [ggml-model-Q4_K_M.gguf] https://huggingface.co/ruslandev/llama-3-8b-gpt-4o-ru1.0-gguf/blob/main/ggml-model-Q4_K_M.gguf

In the text above, the value line is: come up with and write a poem about summer.

Measuring GPU utilisation: the graphs are higher and the most important thing is to increase the response time by 2 times, create embeddings. I'm looking at the CUDA usage chart.

The same model works correctly in version 0.13.0. GPU CUDA is used by more than 80%.

https://huggingface.co/lmstudio-community/Meta-Llama-3.1-8B-Instruct-GGUF/blob/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf:

Sep 06 '24 06:09 aropb

This problem is only on the GPU

Sep 07 '24 08:09 aropb

I found out why the performance drops sharply. This is the TypicalP parameter. All the graphs above were given a value of 0 (this is default valie in DefaultSamplingPipeline())!

If TypicalP = 1

This is already much better. But still, I only see a load of about 60%, how can I achieve more than 80% as LLama native? And why does this parameter have such an effect?

Sep 09 '24 18:09 aropb

The default values in DefaultSamplingPipeline should be the same as the default sampling values in the llama.cpp main application. Typical P is documented as having a default value of 1.0. So it looks like that's a bug in the default values for DefaultSamplingPipeline..

Would you like to submit a PR changing that default value? If so, and if you want to go put in some extra effort, checking all of the default values against the readme I linked above would be a good idea :)

And why does this parameter have such an effect?

All the parameter is doing is enabling a new sampling stage in llama.cpp (it's disabled entirely when the value is 1.0), so presumably it's just that expensive to calculate typical-p.

Sep 10 '24 02:09 martindevans

These two parameters are too high (should decrease to 512 - this is the optimal value for speed and GPU memory use:

BatchSize = 2048,
UBatchSize = 2048

Sep 10 '24 04:09 zsogitbe

It is very easy for you to carefully read everything that I have written. I spent a lot of time on this.And there are already a lot of questions. The speed has increased, utilization is 60%, but still not 80% as it should be. I moved all the default settings as they should be. You should install them in the constructor for others. Batchsize 512 I can't install because this size is not enough for me. And for some reason, it is checked for the equality of batchsize and ubatchsize. LLama native settings batchsize = 2048, ubatchsize = 512.

Sep 10 '24 04:09 aropb

It is sometimes frustrating, but if you dig deeper and learn a bit more about the code, then you will be able to adjust it to your requirements. From your remark starting above with "Launched Llama native" I can see in the output that n_batch = 2048 and n_ubatch = 512 in that example! You should also rethink why a batch size of 512 not enough for you! Maybe learn a bit more about the batch size. It should normally be "enough".

Sep 10 '24 05:09 zsogitbe

I use KM. When creating embeddings, a value less than batchsize=2048 does not work, it throws an error. The problem is not even this, but this:

public LLamaEmbedder(LLamaWeights weights, IContextParams @params, ILogger? logger = null) { if (@params.UBatchSize != @params.BatchSize) { throw new ArgumentException("For non-causal models, batch size must be equal to ubatch size", "params"); }

Sep 10 '24 05:09 aropb

I am using batch size 512 with KM with different models having different vector size and I have never had a problem. 512 is the optimal value for speed and GPU memory use. Your problem will be somewhere else!

Sep 10 '24 05:09 zsogitbe

I'll check it again. Are you using 512/512? ContextSize?

Sep 10 '24 05:09 aropb

The issue of GPU utilization is still relevant.

Sep 10 '24 05:09 aropb

I'll check it again. Are you using 512/512? ContextSize?

Input contains more tokens than configured batch size (Parameter 'batch'). I only need to increase the batchsize without ubatchsize.

Sep 10 '24 06:09 aropb

@params.UBatchSize != @params.BatchSize This is inherited from here.

Unfortunately the embedder can't split input into multiple batches at the moment, so if you are getting embeddings for huge chunks of text you will need an unusually large batch size. Again, this is inherited from that code linked above, which the new embedder is basically ported from.

The issue of GPU utilization is still relevant.

Just a note about this, I think the measurements you're showing might be a little misleading. The typical-p thing gives us a clue - that was adding extra work on the CPU side (sampling is not done on the GPU). So that doesn't lower the actual peak utilisation, but it does increase the amount of idle GPU time between inference (while sampling is running). Slightly different, it's just showing as lower utilisation because the sampling time in task manager is too low.

Sep 10 '24 13:09 martindevans

But I'm comparing Lama native and LLamaSharp graphics on the same machine, in the same configuration. TypicalP=1.

Sep 10 '24 13:09 aropb

I was just pointing out that what you're measuring is mean GPU utilisation, not peak utilisation (because of the low sample rate of task manager). Slightly different things which have very different causes. e.g. the typical-p thing has nothing to do with the GPU, so if we were just looking for GPU related things we'd never have found it!

Sep 10 '24 13:09 martindevans

OK, but why does LLama native show more than 80%?

Sep 10 '24 13:09 aropb

@params.UBatchSize != @params.BatchSize This is inherited from here.

Unfortunately the embedder can't split input into multiple batches at the moment, so if you are getting embeddings for huge chunks of text you will need an unusually large batch size. Again, this is inherited from that code linked above, which the new embedder is basically ported from.

It does not have any sense to use huge chunks of text for embeddings. The optimal size of embeddings is between 100-500 words depending on the task (document retrieval is on the lower side and summarization is on the higher side). So, maximum around 500 words. This should fit into a normal batch size of 512. If you decrease these numbers your GPU utilization will increase.

Sep 11 '24 20:09 zsogitbe

I use a chunk size of 1000 words. Experimentally, this is the best size for me. About batchsize, this question is not really on the topic.

The GPU loading problem manifests itself with normal output, not embedding. Embedding uses 80% of the GPU!!!

You can check everything yourself. Take llama native and see the GPU loading, then take and see the similar output via llamasharp. You're trying to convince me that there's no problem.

The last graph here is indicated for normal output with batch size 512/512.

Sep 11 '24 20:09 aropb

Embedding comes first (80%), then a break, then GenerateText (60%):

Sep 12 '24 08:09 aropb

Embedding and text generation aren't really comparable workloads. Embedding processes one chunk of text all at once (it's similar to prompt processing). Wheras text generation has to constantly go back and forth betwen GPU (generating logits) and CPU (sampling logits to produce tokens). You'd expect GPU load to be lower in that case.

Sep 12 '24 13:09 martindevans

It was just additional information. You persistently ignore the important things that I write :) I am comparing the utilization when working with Lama native and LLamaSharp when doing the same job.

Sep 12 '24 13:09 aropb

I'm not ignoring anything, I'm just pointing out potentially useful information to help you debug the issue.

Sep 12 '24 14:09 martindevans

I'm unlikely to find a reason, because I don't have that deep knowledge of LLM and llama native. I thought that my information would help the experts understand where the performance problem might be. Can you compare GPU utilization on your PC with lama native & lama sharp?

Sep 12 '24 14:09 aropb

I'm unlikely to be able to find time in September for anything more than issue triage and PR reviews unfortunately. I'll be happy to dig into perf stuff in October though (probably after doing a 0.17 update).

Sep 12 '24 14:09 martindevans

Thank you very much for your work! I am ready to participate in improving LLamaSharp as much as my knowledge allows.

Sep 12 '24 14:09 aropb

We also need to bring everything to the use of SamplingPipeline:

StatelessExecutor.InferAsync() ... var repeat_last_n = Math.Max(0, inferenceParams.RepeatLastTokensCount < 0 ? _weights.ContextSize : inferenceParams.RepeatLastTokensCount); ... inferenceParams.RepeatLastTokensCount obsolete !!!

Sep 19 '24 08:09 aropb