LLamaSharp IndexOutOfRangeException when calling IKernelMemory.AskAsync()

While running this example my program crashes with the following error:

Generating answer...
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: freq_base  = 1000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   288.00 MiB
llama_new_context_with_model: KV self size  =  288.00 MiB, K (f16):  144.00 MiB, V (f16):  144.00 MiB
llama_new_context_with_model:  CUDA_Host input buffer size   =    18.57 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   217.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     1.50 MiB
llama_new_context_with_model: graph splits (measure): 2
Unhandled exception. System.IndexOutOfRangeException: Index was outside the bounds of the array.
   at LLama.LLamaContext.ApplyPenalty(Int32 logits_i, IEnumerable`1 lastTokens, Dictionary`2 logitBias, Int32 repeatLastTokensCount, Single repeatPenalty, Single alphaFrequency, Single alphaPresence, Boolean penalizeNL) in ~/LLamaSharp/LLama/LLamaContext.cs:line 361
   at LLama.StatelessExecutor.InferAsync(String prompt, IInferenceParams inferenceParams, CancellationToken cancellationToken)+MoveNext() in ~/LLamaSharp/LLama/LLamaStatelessExecutor.cs:line 109
   at LLama.StatelessExecutor.InferAsync(String prompt, IInferenceParams inferenceParams, CancellationToken cancellationToken)+System.Threading.Tasks.Sources.IValueTaskSource<System.Boolean>.GetResult()
   at Microsoft.KernelMemory.Search.SearchClient.AskAsync(String index, String question, ICollection`1 filters, Double minRelevance, CancellationToken cancellationToken)
   at Microsoft.KernelMemory.Search.SearchClient.AskAsync(String index, String question, ICollection`1 filters, Double minRelevance, CancellationToken cancellationToken)
   at ProgramHelper.AnswerQuestion(IKernelMemory memory, String question) in ~/MLBackend/ProgramHelper.cs:line 110
   at Program.<Main>$(String[] args) in ~/MLBackend/Program.cs:line 32
   at Program.<Main>(String[] args)

I don't believe this was an issue when I was using Mistral but started happening when I switched over to the embedding model specifically the F32 variant.

Apr 11 '24 12:04 WesselvanGils

Could you try running this:

var model = LLamaWeights.LoadFromFile("your_model_path");
Console.WriteLine(model.NewlineToken);

The code that's crashing is this:

var nl_token = model.NewlineToken;
var nl_logit = logits[(int)nl_token];

So it seems like your model is probably returning something unexpected for the newline token.

Apr 11 '24 14:04 martindevans

I see, it's returning -1, that explains the IndexOutOfRange. Is this an issue with the model itself?

Apr 11 '24 14:04 WesselvanGils

I'm not certain, but it doesn't seem correct for any model to be returning -1 for the newline token. That would mean the model has no concept of newlines, which is pretty bizarre!

If other quantizations of the same model are returning other values and it's just the f32 one that's returning -1 I would say that's certainly an error in f32.

Apr 11 '24 14:04 martindevans

I'm not sure on this yet but, not having a newline token seems to be a commonality for embedding models. For nomic I tested F32, F16 and Q2_K, I then also tried this model and they all return -1 for their newline token.

Apr 11 '24 15:04 WesselvanGils

If multiple models are showing the same thing I guess that must be normal. Very weird!

In that case I think the NewlineToken method should be updated to return LLamaToken? instead of LLamaToken and all callsites fixed to handle that sometimes being null.

Apr 11 '24 19:04 martindevans

It could be this is intended behavoir. The models I've been testing are models for generating embeddings so it makes sense that they don't have a newline token as they are never expected to generate text. Doing this

var memory = new KernelMemoryBuilder()
    .WithLLamaSharpTextGeneration(llamaGenerationConfig)
    .WithLLamaSharpTextEmbeddingGeneration(llamaEmbeddingConfig)

Resolves the issue, using the embedding model to generate the embeddings and a regular model to generate the output. WithLLamaSharpDefaults assumes a regular model which is capable of both.

Apr 15 '24 13:04 WesselvanGils

In the #662 PR I've modified how tokens are returned from the LLamaSharp API so it returns nullable tokens and fixed all of the call sites to handle this. I think your approach there is the right one though.

Apr 15 '24 14:04 martindevans

https://github.com/SciSharp/LLamaSharp/commit/c325ac91279428e8d88719357b93e43cd9c5ed52#commitcomment-141108660

Apr 18 '24 12:04 zsogitbe

The same issue happens when using the SemanticKernel integration using the ITextGenerationService, with an embedding model (nomic).

May 09 '24 17:05 psampaio

LLamaSharp LLamaSharp copied to clipboard

IndexOutOfRangeException when calling IKernelMemory.AskAsync()

LLamaSharp
LLamaSharp copied to clipboard