llm
llm copied to clipboard
Embedding extraction
Implements #56.
I ported the llama.cpp code to allow extracting word embeddings and logits from a call to evaluate
. I validated this using an ad_hoc_test
(currently hard-coded in main
) and results seem to make sense: The dot product for different embeddings is higher, the more similar the two words are, which is exactly how embeddings should work.
This serves as a proof of concept, but we need to discuss the API before we can merge. Currently, I added an EvaluateOutputRequest
struct, so we can expand this in the future, allowing retrieval for other interesting bits of the inference process, but these values are not easily obtainable using the regular APIs (i.e. feed_prompt
, infer_next_token
). I'm not sure if that's a problem: Are we ok with users having to drop down to the lower level evaluate
function when they need to retrieve this kind of information?
On a different note, I would really like for someone with a bit of understanding to validate that the results here are correct. Perhaps @hlhr202 can shed some light there?
Finally, should we consider exposing this to llama-cli
at all?
LGTM once the other review feedback's sorted out.
For exposing it from the CLI, I'm not sure... people might use it as a process in a CLI pipeline (get the embeddings of two texts and then comparing them), but I'm not sure what that would look like or how people would do that. (What output format would we use?)
Unless someone can suggest a "standard" output format for this, I'd suggest leaving it out for now and figuring it out later.
LGTM once the other review feedback's sorted out.
For exposing it from the CLI, I'm not sure... people might use it as a process in a CLI pipeline (get the embeddings of two texts and then comparing them), but I'm not sure what that would look like or how people would do that. (What output format would we use?)
Unless someone can suggest a "standard" output format for this, I'd suggest leaving it out for now and figuring it out later.
It would be just nice for me to expose such get embedding function as in crate library. actually i do not care much about cli exposing. what I v seen llama.cpp they provide a parameter --embedding for output purpose. but they still did not find out a way to expose it though. thats why i still cannot get the embedding from their cli currently. I only test a few cases with comparing to openai's embedding. should be some difference, but i think that is caused by different model.
@hlhr202 The CLI is just a consumer of the library crate, so when using the library you'll be able to get the embeddings.
@hlhr202 The CLI is just a consumer of the library crate, so when using the library you'll be able to get the embeddings.
yes absolutely, cuz i m porting llama-rs to llama-node, so i just need library pub function exposing. it doesnt make sense to expose embeddings in cli anyway.
I already addressed the review feedback and removed the ad-hoc test code. So I take it a good plan now would be to merge this as-is and have embedding extraction as a low-level feature of llama-rs, but simply not expose it to the CLI?
Since I added the --dump-prompt-tokens
option, you can probably guess I like exposing information. :) I know people asked about being able to show the embeddings with llama.cli, so it does seem like there's some kind of demand for it in a CLI.
If there's demand, I'm happy to do so - just not sure what the output format should be. JSON array or newline-delimited floats?
Is it a lot of data? You could probably just print in the normal Rust debug format which should look like a comma separated list if it's in a Vec
or something. That should be pretty easy to transform to other formats without need to write extra code or pull in dependencies.
This is the related issue: https://github.com/ggerganov/llama.cpp/issues/224 (there was actually only one person who wanted it as an option)
Is it a lot of data?
It is quite a lot of data for comfortably printing to stdout. It's 4096 floats per token. Not that it wouldn't work, but it's a bit uncomfortable.
Ahh, then seems like it probably isn't worth even bothering to add to the CLI right now unless someone comes here and requests it. Or they could probably just write their own little utility to load a model, feed a prompt and print out embeddings however they wanted.
Ahh, then seems like it probably isn't worth even bothering to add to the CLI right now unless someone comes here and requests it. Or they could probably just write their own little utility to load a model, feed a prompt and print out embeddings however they wanted.
I would love that in the CLI! Perhaps with a parameter that specifies an output file. I need the embeddings to build a vector database based on some local files. Any chance you could take a look? It has been many years since I programmed C/C++.
Ahh, then seems like it probably isn't worth even bothering to add to the CLI right now unless someone comes here and requests it. Or they could probably just write their own little utility to load a model, feed a prompt and print out embeddings however they wanted.
I would love that in the CLI! Perhaps with a parameter that specifies an output file. I need the embeddings to build a vector database based on some local files. Any chance you could take a look? It has been many years since I programmed C/C++.
vector is around 4096 length for one token, not very suitable for being well printed in CLI. I guess you need to call it through rust api.
I'm open to adding a way for the CLI to output embeddings if people find this is an interesting use case. The main blocker here is that the use case is not clear to me and thus I can't figure out the right API and output format.
What we need here, is someone who understands how embeddings in an LLM like LLaMA work, has a clear use case for extracting them and can tell us how would they expect an API like this to work. If anyone wants to open an issue with a clear description of what we need to provide, I'd be happy to add an implementation :slightly_smiling_face:
@setzer22 I have made a new embedding extraction example. can check it here https://github.com/hlhr202/llama-node/blob/develop/packages/core/example/semantic-compare/compare.py I noticed that llama.cpp they use "\n" as end token, so I also do the same. it is quite close to openai's text-embedding-ada-002 result
I'm working on a large dense-vector embedding database (about 2 million data points from books), which is currently using OpenAI's Ada embeddings (~1600 dimensions). I can do a comparison of performance between those and the 4k LLaMa embeds if needed.
as a clear use case for extracting them and can tell us how would they expect an API like this to work
From an ops perspective, ideally one could provide a batch input and get a batch output (just like OpenAI's API) via CLI. The format doesn't matter much - it can be JSONL or a binary format. I'd personally recommend sticking to those two, since that is supported by most VSS databases (e.g. Redis RediSearch).
My use case here is if you have a sets of documents, and if you can get the embeddings of those documents, whenever a new question comming in, you can embed the new questions and find the most relevant documents to send a long with you prompt. So basically you can have a natural Q&A chat bot based on your own data.