LLamaSharp
LLamaSharp copied to clipboard
Embeddings Change In April Update
See comments starting here for previous discussion: https://github.com/SciSharp/LLamaSharp/commit/c325ac91279428e8d88719357b93e43cd9c5ed52#commitcomment-141108660. Ping @zsogitbe.
I've been looking into the embeddings issue discuvered by @zsogitbe. In the April update there has been a change to the embeddings API which has broken things and I'm wondering if we need a new API entirely.
There are now two methods for getting embeddings in llama.cpp:
llama_get_embeddings_ithwill get the embeddings of theithtoken which hadlogits = truein the input sequence. It seems that this will get the embedding of a single token and is intended for use generative models.llama_get_embeddings_seqwill get the embeddings of an entire sequence. Intended for use with sequence embedding models.
Our current LLamaEmbedder class is designed to work with the second type - you input an entire sequence and you get back a single embedding for that sequence. I think that's probably what most people want from an embedding system.
However, in the current state this means that using a generative model with the LLamaEmbedder is now broke. Although I suspect it was actually always broken - before it was simply returning unexpected results (a single token embedding) and now it's returning nothing.
Should we add a new type of embedder (e.g. LLamaTokenEmbedder) which is compatible with generative models and allows you to get something like Dictionary<LLamaToken, float[]> from a sequence?
First of all, we definitely need embedding generation with normal/generative models! I can also confirm that the old code was not broken and worked well for me with normal/generative models. It is also the main philosophy behind embeddings that they represent what generative models 'see' from the input text, so it is logical to use generative models (preferably the same model as used for generating content).
It is possible that there are some specialized model types that are not generative models (like all-MiniLM-L12-v2.Q8_0.gguf to generate an embedding for a sentence) and that llama.cpp may support these models somehow. Allowing embedding generation only with these non-generative models is a bad approach. Supporting both is a good approach.
Note: It is well known that lama models are not very good in generating well formed embeddings, but llama.cpp/LLamaSharp support other model types also.
The llama_get_embeddings_seq will always return NULL if pooling_type is NONE. Sequence embeddings supported only when pooling_type is not NONE! So, this could be one of the reasons for NULL embedding with the new code.
We probably need to do the same as in C++ (this takes into account if there is pooling):
// try to get sequence embeddings - supported only when pooling_type is not NONE
const float * embd = llama_get_embeddings_seq(ctx, batch.seq_id[i][0]);
if (embd == NULL) {
embd = llama_get_embeddings_ith(ctx, i);
if (embd == NULL) {
fprintf(stderr, "%s: failed to get embeddings for token %d\n", __func__, i);
continue;
}
}
I don't think it's possible to support generative models with the current llama.cpp API, unless I'm misunderstanding something. I hope so, because I agree being able to use generative models for embedding would be useful!
The docs on llama_get_embeddings say:
When pooling_type == LLAMA_POOLING_TYPE_NONE or when using a generative model, the embeddings for which
llama_batch.logits[i] != 0are stored contiguously in the order they have appeared in the batch.
i.e. a generative model won't combine the embeddings into one single vector, instead it seems you can just get the embedding for each individual token. The current embedder only sets logits = true for the very last token, so you'll presumably just be getting the embeddings for that single token?
I tried using pooling=mean with llama-2-7b-chat but that just crashes.
C++ code...
Inspired by that same C++ code I modified the embedder to look like this earlier:
unsafe
{
// Get the embedding for the whole sequence (for embedding models with pooling)
var embeddings = NativeApi.llama_get_embeddings_seq(Context.NativeHandle, LLamaSeqId.Zero);
// Get the embedding for the 0th item, we only set logits=true for one item so this seems like it should work.
// **Always** returns null no matter how I test it!
if (embeddings == null)
embeddings = NativeApi.llama_get_embeddings_ith(Context.NativeHandle, 0);
// This one gets me something, but as far as I can tell it should be identical to the above.
if (embeddings == null)
embeddings = NativeApi.llama_get_embeddings(Context.NativeHandle);
if (embeddings == null)
return Array.Empty<float>();
return new Span<float>(embeddings, Context.EmbeddingSize).ToArray();
}
So that's a confusing result!
Probably relevant: https://github.com/ggerganov/llama.cpp/pull/6753
I have checked the C++ code and you need to change the index to -1! So, this is the good code:
embeddings = NativeApi.llama_get_embeddings_ith(Context.NativeHandle, -1);
This should give the same result as llama_get_embeddings.
I would also change the code to optimize for speed. So, this is what I would do:
- check if there is pooling
- check if there are several sequences
- if no pooling and only 1 sequence, then use
llama_get_embeddings - if there is pooling use
llama_get_embeddings_seq - if there is more than one sequence use
llama_get_embeddings_ith.
llama_get_embeddings_ith is the same as llama_get_embeddings but with overhead! So, if there is only 1 sequence, then the best is to use llama_get_embeddings.
If there will be a new function in the future which can give us a pooling embedding (for example, llama_get_embeddings_mean_pooled), then we will incorporate that in the above logic.
I have added a PR for correction which supports all types of models (no pooling and several sequences are needed yet).
I don't think it's possible to support generative models with the current llama.cpp API, unless I'm misunderstanding something. I hope so, because I agree being able to use generative models for embedding would be useful!
Martin, the most important use case for embeddings is memory (KernelMemory). A generative model embeds information in a vector database which can be recalled later (long term memory). The generative model regenerates (recalls) the information from long term memory.
As far as I know, the small model you use for testing is not a generative model and can only generate an embedding for a short sentence. This embedding can then be used, for example, for checking similarity between sentences. Further, you cannot really do much with these embeddings.
So, supporting embeddings with generative models is the main aim here and thus cannot be removed.
you need to change the index to -1!
Aha, thanks for looking into that!
Since a parameter of -N means "N from the end" that kind of makes sense - we're only setting logits=true for the very last token. I had though 0 would work here because indexing is only over tokens where logits=true but I guess not.
if there is more than one sequence use llama_get_embeddings_ith.
All of this discussion with the embedder has made me realise there's a lot of potential for improvements here, most importantly batched embeddings with multiple sequences. Is that something you might be interested in working on?
I have added a PR
I just reviewed it :)
So, supporting embeddings with generative models is the main aim here and thus cannot be removed.
Just to re-iterate, I'm not trying to remove anything here or making a value judgement on what models should and shouldn't be used. Just exposing whatever capabilities llama.cpp is exposing 😅
I think this embedding stuff was resolved a whole ago so I'll close this issue now.