vllm icon indicating copy to clipboard operation
vllm copied to clipboard

/v1/embeddings please

Open yuhai-china opened this issue 2 years ago • 1 comments

when will /v1/embeddings API available? Thank you

yuhai-china avatar Jun 21 '23 08:06 yuhai-china

Hi! Adding support to return embeddings is definitely on our road map. In addition, I believe the modifications to support embedding should not be very complicated and is a very good first issue. If you're interested, feel free to contribute!

zhuohan123 avatar Jun 21 '23 10:06 zhuohan123

I looked into it to maybe pick it up as a "good first issue", but did not find it to be straightforward to implement. I'm afraid any changes I would make, would just be hacks. If you have any pointers on how and where I could best add it, I'd be happy to give it a second look.

Vinno97 avatar Jun 26 '23 16:06 Vinno97

@zhuohan123 and @yuhai-china are you talking about a multilingual or the monolingual model? I'm keen to contribute and would be interested in the sentence-transformer model from HF Zoo.

bm777 avatar Jun 27 '23 22:06 bm777

@Vinno97 are you still working on it? I would love to help because I'm interested to use it too.

bm777 avatar Jun 29 '23 07:06 bm777

No I haven't come back to it. I hoped I could just create a new endpoint that hooked into the model and returned the last hidden state. But I found that the LLMEngine was so written around text generation that I didn't see myself easily and cleanly adding embeddings into it.

But do give it a try! I must admit I spent less than an hour looking into it.

Vinno97 avatar Jun 29 '23 13:06 Vinno97

@yuhai-china @Vinno97 @bm777 Thanks for your interest in this. I previously misunderstood this API to be getting the hidden states for the generated sequence, and that should be easy. However, it turns out that this API is for a completely different set of models (i.e., BERT-like embedding models).

The current vLLM mainly focuses on autoregressive generation. For embedding, both paged attention and continuous batching cannot help performance. Therefore, I think it's better to use other libraries for embedding for now. In the future, when we are extending the scope of vLLM, we will look into this again.

zhuohan123 avatar Jun 29 '23 14:06 zhuohan123

Move this issue to discussions as it's more of a longer future plan.

zhuohan123 avatar Jun 29 '23 14:06 zhuohan123