llm Embedding extraction

Embedding extraction

Open setzer22 opened this issue 1 year ago • 4 comments

Implements #56.

I ported the llama.cpp code to allow extracting word embeddings and logits from a call to evaluate. I validated this using an ad_hoc_test (currently hard-coded in main) and results seem to make sense: The dot product for different embeddings is higher, the more similar the two words are, which is exactly how embeddings should work.

This serves as a proof of concept, but we need to discuss the API before we can merge. Currently, I added an EvaluateOutputRequest struct, so we can expand this in the future, allowing retrieval for other interesting bits of the inference process, but these values are not easily obtainable using the regular APIs (i.e. feed_prompt, infer_next_token). I'm not sure if that's a problem: Are we ok with users having to drop down to the lower level evaluate function when they need to retrieve this kind of information?

On a different note, I would really like for someone with a bit of understanding to validate that the results here are correct. Perhaps @hlhr202 can shed some light there?

Finally, should we consider exposing this to llama-cli at all?

Mar 24 '23 20:03 setzer22

LGTM once the other review feedback's sorted out.

For exposing it from the CLI, I'm not sure... people might use it as a process in a CLI pipeline (get the embeddings of two texts and then comparing them), but I'm not sure what that would look like or how people would do that. (What output format would we use?)

Unless someone can suggest a "standard" output format for this, I'd suggest leaving it out for now and figuring it out later.

Mar 25 '23 15:03 philpax

LGTM once the other review feedback's sorted out.

For exposing it from the CLI, I'm not sure... people might use it as a process in a CLI pipeline (get the embeddings of two texts and then comparing them), but I'm not sure what that would look like or how people would do that. (What output format would we use?)

Unless someone can suggest a "standard" output format for this, I'd suggest leaving it out for now and figuring it out later.

It would be just nice for me to expose such get embedding function as in crate library. actually i do not care much about cli exposing. what I v seen llama.cpp they provide a parameter --embedding for output purpose. but they still did not find out a way to expose it though. thats why i still cannot get the embedding from their cli currently. I only test a few cases with comparing to openai's embedding. should be some difference, but i think that is caused by different model.

Mar 25 '23 16:03 hlhr202

@hlhr202 The CLI is just a consumer of the library crate, so when using the library you'll be able to get the embeddings.

Mar 25 '23 16:03 KerfuffleV2

@hlhr202 The CLI is just a consumer of the library crate, so when using the library you'll be able to get the embeddings.

yes absolutely, cuz i m porting llama-rs to llama-node, so i just need library pub function exposing. it doesnt make sense to expose embeddings in cli anyway.

Mar 25 '23 16:03 hlhr202

I already addressed the review feedback and removed the ad-hoc test code. So I take it a good plan now would be to merge this as-is and have embedding extraction as a low-level feature of llama-rs, but simply not expose it to the CLI?

Mar 26 '23 10:03 setzer22

Since I added the --dump-prompt-tokens option, you can probably guess I like exposing information. :) I know people asked about being able to show the embeddings with llama.cli, so it does seem like there's some kind of demand for it in a CLI.

Mar 26 '23 13:03 KerfuffleV2

If there's demand, I'm happy to do so - just not sure what the output format should be. JSON array or newline-delimited floats?

Mar 26 '23 13:03 philpax

Is it a lot of data? You could probably just print in the normal Rust debug format which should look like a comma separated list if it's in a Vec or something. That should be pretty easy to transform to other formats without need to write extra code or pull in dependencies.

This is the related issue: https://github.com/ggerganov/llama.cpp/issues/224 (there was actually only one person who wanted it as an option)

Mar 26 '23 13:03 KerfuffleV2

Is it a lot of data?

It is quite a lot of data for comfortably printing to stdout. It's 4096 floats per token. Not that it wouldn't work, but it's a bit uncomfortable.

Mar 26 '23 15:03 setzer22

Ahh, then seems like it probably isn't worth even bothering to add to the CLI right now unless someone comes here and requests it. Or they could probably just write their own little utility to load a model, feed a prompt and print out embeddings however they wanted.

Mar 26 '23 15:03 KerfuffleV2

Ahh, then seems like it probably isn't worth even bothering to add to the CLI right now unless someone comes here and requests it. Or they could probably just write their own little utility to load a model, feed a prompt and print out embeddings however they wanted.

I would love that in the CLI! Perhaps with a parameter that specifies an output file. I need the embeddings to build a vector database based on some local files. Any chance you could take a look? It has been many years since I programmed C/C++.

Apr 02 '23 05:04 rpbrokaw

Ahh, then seems like it probably isn't worth even bothering to add to the CLI right now unless someone comes here and requests it. Or they could probably just write their own little utility to load a model, feed a prompt and print out embeddings however they wanted.

I would love that in the CLI! Perhaps with a parameter that specifies an output file. I need the embeddings to build a vector database based on some local files. Any chance you could take a look? It has been many years since I programmed C/C++.

vector is around 4096 length for one token, not very suitable for being well printed in CLI. I guess you need to call it through rust api.

Apr 02 '23 08:04 hlhr202

I'm open to adding a way for the CLI to output embeddings if people find this is an interesting use case. The main blocker here is that the use case is not clear to me and thus I can't figure out the right API and output format.

What we need here, is someone who understands how embeddings in an LLM like LLaMA work, has a clear use case for extracting them and can tell us how would they expect an API like this to work. If anyone wants to open an issue with a clear description of what we need to provide, I'd be happy to add an implementation :slightly_smiling_face:

Apr 02 '23 08:04 setzer22

@setzer22 I have made a new embedding extraction example. can check it here https://github.com/hlhr202/llama-node/blob/develop/packages/core/example/semantic-compare/compare.py I noticed that llama.cpp they use "\n" as end token, so I also do the same. it is quite close to openai's text-embedding-ada-002 result

Apr 02 '23 16:04 hlhr202

I'm working on a large dense-vector embedding database (about 2 million data points from books), which is currently using OpenAI's Ada embeddings (~1600 dimensions). I can do a comparison of performance between those and the 4k LLaMa embeds if needed.

as a clear use case for extracting them and can tell us how would they expect an API like this to work

From an ops perspective, ideally one could provide a batch input and get a batch output (just like OpenAI's API) via CLI. The format doesn't matter much - it can be JSONL or a binary format. I'd personally recommend sticking to those two, since that is supported by most VSS databases (e.g. Redis RediSearch).

Apr 03 '23 01:04 turbo

My use case here is if you have a sets of documents, and if you can get the embeddings of those documents, whenever a new question comming in, you can embed the new questions and find the most relevant documents to send a long with you prompt. So basically you can have a natural Q&A chat bot based on your own data.

May 08 '23 14:05 merlinvn

llm llm copied to clipboard

Embedding extraction

llm
llm copied to clipboard