text-generation-inference [Feature] Return embeddings

As title indicates I'd be interested in understanding whether this is just for text-generation or whether it could also be used to expose the embedding function?

Apr 18 '23 20:04 darth-veitcher

For now it does not return the embeddings but this could be added in the future.

Apr 19 '23 07:04 OlivierDehaene

Ah great. Thanks for the response @OlivierDehaene. The embeddings would be of interest for indexing content and subsequently using a vector store.

Apr 19 '23 09:04 darth-veitcher

Do you return an embedding for each token? I am not the most familiar with this use case.

Apr 19 '23 09:04 OlivierDehaene

I’m specifically looking at the use case of indexing content and storing in something like pinecone or OpenSearch for subsequent querying and retrieval.

Langchain has a good overview in their indexes documentation but essentially:

for each file;
split into chunks;
calculate embeddings for chunks;
save to vectorstore

As a result I’d need to have an embedding function available both for the initial calculation and storage and then at a later point to assist with replicating for the query.

I think this is quite a common use case and pattern but I could be wrong.

Apr 19 '23 10:04 darth-veitcher

Really looking forward for this feature.

Apr 19 '23 21:04 sonsai123

Any update on this in terms of priority, effort, timeline @OlivierDehaene ? Appreciate all the work so far and can see there's been a lot of commits since this was originally raised !

May 02 '23 07:05 darth-veitcher

Not sure if you all were looking for the return of embeddings from a decoder model (hidden state) or a dedicated implementation for things like sentence-transformers but I started a fork of this repo to work with sentence transformers.

It doesn't have model sharding or NCCL comms right now since none of the models in sentence-transformers are that large but hopefully we will support that some day!

https://github.com/Gage-Technologies/embedding-server

May 22 '23 18:05 sam-ulrich1

Would love to see an /embeddings endpoint for use with vector DB's like Pinecone, Weviate, Faiss, Milvus etc.

Hopefully a gentle bump and inspiration helps :)

Here's a couple references for inspiration: https://milvus.io/docs/integrate_with_hugging-face.md https://platform.openai.com/docs/api-reference/embeddings

Jul 08 '23 16:07 M-Chris

Hello, I am interested in implementing this feature. Any tips on the best pathway would be appreciated.

The focus would be around serving transformer-based dense embeddings.

Jul 12 '23 04:07 jon-chuang

AFAIK, embeddings usually use very different models, and have very different properties. Including something here therefore doesn't make a whole lot of sense.

sentence-transformers https://www.sbert.net/ is the basic way to go, no ? (with opensource models unlike OpenAI embedding models which we can't ever serve).

There might be ways to create optimized servings, but lauching a simple flash server in front of sentence-transformers should be enough no ? Those models are usually dirt small compared to LLMs.

Jul 12 '23 09:07 Narsil

Hi @Narsil , I suppose you may be right that this is not necessarily the best framework. I was hoping for out of the box experience with:

built in http endpoints
queuing
batched inference
optimized concurrent serving (including choosing the right concurrency, choosing the right serving runtime e.g. ONNX)
huggingface (& sentence-bert) integration.

There is an article by Vespa.ai on optimizing concurrent serving. Any tips on the right framework for serving embeddings (esp integrated with huggingface) would be appreciated.

Jul 12 '23 10:07 jon-chuang

That being said, if no framework exists which fits these requirements, it doesn't sound far fetched that one may build upon the work in this repo. Serving sentence-bert models would be necessary.

Jul 12 '23 10:07 jon-chuang

That being said, if no framework exists which fits these requirements, it doesn't sound far fetched that one may build upon the work in this repo.

If you look up in the thread you'll see a link to a project specifically for what you're asking. It's an embedding server derived from this repo. We'd love your help improving it. Right now it does get maintenance but on an as needed basis. With that said it does work for any model that can be used with the sentence-transformers library. What we really need is to finish the actions pipeline to roll out docker images. If you manually build the docker it will run.

Jul 12 '23 11:07 sam-ulrich1

@sam-ulrich1 I am simply afraid that since it is currently maintained for a single company, there is not enough visibility and long term support to be worth investing in it and recommending it to users of LLM applications frameworks (such as llamaindex).

It would be great if you could break down what you have managed to achieve with your fork and whether there might exist a pathway to merging it into this repo. Of course @Narsil and @OlivierDehaene would have to agree that it is a useful enough feature.

It does seem that quite a handful of users are interested in it. If the pathway is not complex, it seems like a win for the community.

Jul 12 '23 11:07 jon-chuang

Hate to say it but there's no chance it would get merged. It's a hard fork. With that said, the easiest way to make sure it stays supported is to help out!

Jul 12 '23 11:07 sam-ulrich1

Hate to say it but there's no chance it would get merged

What I mean of course is to extract out the key changes and to contribute PRs to this repo.

Jul 12 '23 11:07 jon-chuang

Given the breadth of changes from casual language modeling to encoder models I don't think it's likely that the team here at text-generation would accept a PR for it (I don't speak for them just speculating).

Embedding generation works very differently from the intended use case of this repo. With that said, you (or anyone else) is more than welcome to look over our repo and make a PR

Jul 12 '23 13:07 sam-ulrich1

Code complexity for something relatex to embeddings should be.. .MUCH smaller (there's no decode, no past key values, no paged attention).

I think flash attention would be the main asset and classic dynamic batching should work great.

Jul 12 '23 13:07 Narsil

Ok, thanks folks. I will look into simpler solutions and look out out for flash attention and dynamic batching.

Jul 12 '23 18:07 jon-chuang

@jon-chuang I'm not sure if this is what you were thinking but probably it's easiest to add embeddings mostly in parallel (rather than deeply inbuilt) to tgi.

use a model fine-tuned or prompted for function calling, specifically with a function called search_vector_database, which would have an input argument that would be the user's message.
write a search_vector_database function on the server side so that it vectorises (probably with a simple tokenizer like sentencepiece) the user's message and then does a cosine similarity search of whatever pre-vectorised docs you want (which would have to also be server side)
modify the tgi (or, probably easier, the chat-ui or other ui code) code so that it checks the assistant response for a function call with search_vector_database in it and, if so, makes a call to that function. The result from the cosine similarity search should then automatically be fed back in (along with the user's query) to the language model. Lastly, the LLM's response would be directed to the UI for the user to see it.

It may actually be better to handle all of this logic in the UI code, for example in chat-ui, and just use tgi as an api for feeding in input text.

Aug 28 '23 17:08 RonanKMcGovern

We will ship a new serving container that will only do text-embeddings focused on serverless, dynamic batching and using our new Candle library in the coming weeks.

Sep 06 '23 13:09 OlivierDehaene

hi @OlivierDehaene seems really interesting, do you have any target release date for this text-embeddings server ? We would be glad to beta test it at Credit Mutuel Arkea :)

Sep 15 '23 09:09 Benvii

btw @OlivierDehaene what layers are you using for the embeddings? Just the first layer?

Sep 15 '23 12:09 RonanKMcGovern

@RonanKMcGovern embeddings are done through dedicated models here is a leaderboard we have for these: https://huggingface.co/spaces/mteb/leaderboard (Always take leaderboards and benchmark with a pinch of salt, your use case is rarely the benchmark under test)

@Benvii nice !

It's coming up nicely for now !

Sep 20 '23 15:09 Narsil

Ok, thanks @Narsil , I naively just tested by using the first layer of llama and it works pretty ok, but yeah I imagine specialised models are better.

Sep 20 '23 15:09 RonanKMcGovern

Hi @OlivierDehaene,

We are planning the deployment of self-hosted models using text-generation-inference, but we would love to have the new service that you just mentioned for text-embeddings. Our complete use case is something that requires RAG with open source LLMs and embeddings. Here at Adyen we would also like to be early adopters or beta testers. Also we would like to contribute back to the project if there is an opportunity for it.

Oct 04 '23 13:10 rahermur

@rahermur @OlivierDehaene is finishing it up, but we're seeing quite nice performance atm and we're leveraging candle for maximum performance (embeddings models tend to be small and therefore CPU bottleneck is even more noticeable than with LLMs)

Oct 05 '23 14:10 Narsil

FYI, I just created a small project called infinity. Its a lightweight async implementation using fastapi and pydantic for input validation, torch and ctranslate2 under the hood, performs dynamic batching via async, under MIT Licence.

Oct 11 '23 18:10 michaelfeil

FYI, I just created a small project called infinity. Its a lightweight async implementation using fastapi and pydantic for input validation, torch and ctranslate2 under the hood, performs dynamic batching via async, under MIT Licence.

Hi,

Great stuff. Looks like this could be the solution. I yet didn't make it. Two questions:

When I run a curl from swagger I get an oddly huge reply. What's wrong there pls?:
how do I find out which (mini) model to select? How do I see from e.g. https://huggingface.co/h2oai/h2ogpt-4096-llama2-7b-chat or https://huggingface.co/h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b which embedding was used . I guess I will then be able to search a small model to run 'infinity' with. (Yes, this is a general question. Would be great if it was answered here anyway)

BTW this is my docker-compose snippet:

  infinity:
    image: michaelf34/infinity:latest 
    ports:
      - 8081:8080
    volumes:
      - ./torch:/app/.cache/torch/
    command: --model-name-or-path sentence-transformers/all-MiniLM-L6-v2 --port 8080 --engine ctranslate2
--engine ctranslate2
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: ["gpu"]

and these are a few lines from my langchain app. I used the OllamaEmbeddings. (Therefore, I also need a slight http-rewrite rule and that's why the port doesn't match the docker compose).

import { OllamaEmbeddings } from "langchain/embeddings/ollama";
const embeddings = new OllamaEmbeddings({
  model: "all-MiniLM-L6-v2", // default value
  baseUrl: "http://" + process.env.LLM_ADDRESS + ":80", // default value
});

Someone have a better solution in mind? Not urgent, looks like this is working though not too nice.

Oct 12 '23 17:10 ludwigprager

@ludwigprager Did not want to "hijack" this issue, in case you have questions -> https://github.com/michaelfeil/infinity/issues

td;lr: Thanks for the docker. Sorry, wrong usage. Infinity is a drop-in replacement for e.g. openai-embeddings. https://platform.openai.com/docs/guides/embeddings/what-are-embeddings. In short, for text-embeddings, you want to deploy any of those models (not falcon/llama!!!): https://huggingface.co/spaces/mteb/leaderboard. The result you see are vectors, not auto-regressive text.

Oct 12 '23 18:10 michaelfeil

text-generation-inference text-generation-inference copied to clipboard

[Feature] Return embeddings

text-generation-inference
text-generation-inference copied to clipboard