LLamaSharp
LLamaSharp copied to clipboard
IndexOutOfRangeException when calling IKernelMemory.AskAsync()
While running this example my program crashes with the following error:
Generating answer...
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: freq_base = 1000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 288.00 MiB
llama_new_context_with_model: KV self size = 288.00 MiB, K (f16): 144.00 MiB, V (f16): 144.00 MiB
llama_new_context_with_model: CUDA_Host input buffer size = 18.57 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 217.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 1.50 MiB
llama_new_context_with_model: graph splits (measure): 2
Unhandled exception. System.IndexOutOfRangeException: Index was outside the bounds of the array.
at LLama.LLamaContext.ApplyPenalty(Int32 logits_i, IEnumerable`1 lastTokens, Dictionary`2 logitBias, Int32 repeatLastTokensCount, Single repeatPenalty, Single alphaFrequency, Single alphaPresence, Boolean penalizeNL) in ~/LLamaSharp/LLama/LLamaContext.cs:line 361
at LLama.StatelessExecutor.InferAsync(String prompt, IInferenceParams inferenceParams, CancellationToken cancellationToken)+MoveNext() in ~/LLamaSharp/LLama/LLamaStatelessExecutor.cs:line 109
at LLama.StatelessExecutor.InferAsync(String prompt, IInferenceParams inferenceParams, CancellationToken cancellationToken)+System.Threading.Tasks.Sources.IValueTaskSource<System.Boolean>.GetResult()
at Microsoft.KernelMemory.Search.SearchClient.AskAsync(String index, String question, ICollection`1 filters, Double minRelevance, CancellationToken cancellationToken)
at Microsoft.KernelMemory.Search.SearchClient.AskAsync(String index, String question, ICollection`1 filters, Double minRelevance, CancellationToken cancellationToken)
at ProgramHelper.AnswerQuestion(IKernelMemory memory, String question) in ~/MLBackend/ProgramHelper.cs:line 110
at Program.<Main>$(String[] args) in ~/MLBackend/Program.cs:line 32
at Program.<Main>(String[] args)
I don't believe this was an issue when I was using Mistral but started happening when I switched over to the embedding model specifically the F32 variant.
Could you try running this:
var model = LLamaWeights.LoadFromFile("your_model_path");
Console.WriteLine(model.NewlineToken);
The code that's crashing is this:
var nl_token = model.NewlineToken;
var nl_logit = logits[(int)nl_token];
So it seems like your model is probably returning something unexpected for the newline token.
I see, it's returning -1, that explains the IndexOutOfRange. Is this an issue with the model itself?
I'm not certain, but it doesn't seem correct for any model to be returning -1
for the newline token. That would mean the model has no concept of newlines, which is pretty bizarre!
If other quantizations of the same model are returning other values and it's just the f32 one that's returning -1
I would say that's certainly an error in f32.
I'm not sure on this yet but, not having a newline token seems to be a commonality for embedding models. For nomic I tested F32, F16 and Q2_K, I then also tried this model and they all return -1
for their newline token.
If multiple models are showing the same thing I guess that must be normal. Very weird!
In that case I think the NewlineToken
method should be updated to return LLamaToken?
instead of LLamaToken
and all callsites fixed to handle that sometimes being null.
It could be this is intended behavoir. The models I've been testing are models for generating embeddings so it makes sense that they don't have a newline token as they are never expected to generate text. Doing this
var memory = new KernelMemoryBuilder()
.WithLLamaSharpTextGeneration(llamaGenerationConfig)
.WithLLamaSharpTextEmbeddingGeneration(llamaEmbeddingConfig)
Resolves the issue, using the embedding model to generate the embeddings and a regular model to generate the output.
WithLLamaSharpDefaults
assumes a regular model which is capable of both.
In the #662 PR I've modified how tokens are returned from the LLamaSharp API so it returns nullable tokens and fixed all of the call sites to handle this. I think your approach there is the right one though.
https://github.com/SciSharp/LLamaSharp/commit/c325ac91279428e8d88719357b93e43cd9c5ed52#commitcomment-141108660
The same issue happens when using the SemanticKernel integration using the ITextGenerationService, with an embedding model (nomic).