which way is the fastest to infer a LLM-based embedder?

Open yanfan0531 opened this issue 8 months ago • 0 comments

Hi, I was looking at the BAAI/bge-multilingual-gemma2 model.

Then I use GPU for inference via transformers, I found it very slow. Takes about several seconds to encode one sentence. Is it normal? Normally, how long does it take to get an embedding?

I noticed that FlagEmbedding and sentence_transformer are also available for inference. Which way is the fastest?

Is vLLM gonna help?

Apr 21 '25 11:04 yanfan0531