vllm icon indicating copy to clipboard operation
vllm copied to clipboard

First working PoC for bge-m3 sparse embeddings

Open maxdebayser opened this issue 8 months ago β€’ 7 comments

FIX: https://github.com/vllm-project/vllm/issues/13609 FIX #15384

Here I'm loading the extra sparse_linear.pt file using the secondary_weights loading introduced in the ultravox model when I detect that the model name is BAAI/bge-m3. It's a bit ugly but I don't know if there is a more generic way to do this.

Currently, since the only permissible pooling return type is torch.tensor, I'm just returning the token weights tensor directly. If the use wants to match tokens to the weights they have to call tokenize and remove the bos and eos token and then the indices of both vectors should match.

To request sparse vectors the use has to pass "additional_data": {"sparse_embeddings": true} in the request. This means that all sequences in that request will be treated as sparse. If the user wants to mix, separate calls have to be made for each type of embedding.

The FlagEmbedding API allows to return more then one type of embedding at the same time, but currently, due to the limitation of the pooling return type we can only return a single tensor per sequence.

To show that this PoC is already returning the correct results, consider the code below:

from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel('BAAI/bge-m3',  use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation

sentences_1 = ["What is BGE M3?", "Defination of BM25"]

output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=False)
print(model.convert_id_to_token(output_1['lexical_weights']))

This code prints

[{'What': 0.08344, 'is': 0.08136, 'B': 0.1295, 'GE': 0.252, 'M': 0.1702, '3': 0.2695, '?': 0.04086}, {'De': 0.05023, 'fin': 0.1368, 'ation': 0.0452, 'of': 0.0635, 'BM': 0.2515, '25': 0.3337}]

With vllm we get the following:

$ curl -s http://localhost:8000/v1/embeddings    -H "Content-Type: application/json"    -d '{
     "model": "BAAI/bge-m3",
     "input": ["What is BGE M3?", "Defination of BM25"],
     "additional_data": {"sparse_embeddings": true}
}' | jq
{
  "id": "embd-38ce076880b94d41b206ae99caae7b19",
  "object": "list",
  "created": 1741555561,
  "model": "BAAI/bge-m3",
  "data": [
    {
      "index": 0,
      "object": "embedding",
      "embedding": [
        0.0836181640625,
        0.08148193359375,
        0.1295166015625,
        0.251708984375,
        0.1700439453125,
        0.269775390625,
        0.040924072265625
      ]
    },
    {
      "index": 1,
      "object": "embedding",
      "embedding": [
        0.050201416015625,
        0.136962890625,
        0.04510498046875,
        0.0633544921875,
        0.25146484375,
        0.333740234375
      ]
    }
  ],
  "usage": {
    "prompt_tokens": 17,
    "total_tokens": 17,
    "completion_tokens": 0,
    "prompt_tokens_details": null
  }
}

maxdebayser avatar Mar 09 '25 21:03 maxdebayser

πŸ‘‹ Hi! Thank you for contributing to the vLLM project.

πŸ’¬ Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

πŸš€

github-actions[bot] avatar Mar 09 '25 21:03 github-actions[bot]

To support sparse+dense together, we need to actually implement #12249. I still don't have time to implement this though.

DarkLight1337 avatar Mar 10 '25 07:03 DarkLight1337

I've changed the implementation so that now the user has to add --hf-overrides '{"architectures": ["BgeM3EmbeddingModel"]}' to the command line to activate this mode. But I agree that we need to implement https://github.com/vllm-project/vllm/issues/12249 to properly support this and other models like ibm-granite/granite-embedding-30m-sparse. Let's keep this PR in draft state for now.

maxdebayser avatar Mar 13 '25 14:03 maxdebayser

This is great, looking forward to the launch of this feature, how long will it take for this feature to be available?

243006306 avatar Mar 20 '25 10:03 243006306

+1, waiting for this feature.

IllyaPysarchuk avatar Mar 26 '25 13:03 IllyaPysarchuk

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @maxdebayser.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Apr 01 '25 06:04 mergify[bot]

+1

arjunasuresh300 avatar Apr 30 '25 08:04 arjunasuresh300

any update?

Sam120204 avatar Jun 17 '25 17:06 Sam120204

The V1 embedding PR is already approved but is now blocked by other unrelated test failures: https://github.com/vllm-project/vllm/pull/16188 . The next step will be to add support for encoder models as they have been left out of the embedding model PR to make it simpler.

maxdebayser avatar Jun 17 '25 17:06 maxdebayser

ηŽ°εœ¨θΏ˜δΈζ”―ζŒε“‡

fufenghua avatar Jul 08 '25 03:07 fufenghua

I think this should be possible now that we support multiple poolers

DarkLight1337 avatar Jul 24 '25 03:07 DarkLight1337

I think this should be possible now that we support multiple poolers We can select the embedding types per request, right? But can we have multiple pooling strategies applied on the same request? Anyway, I'll revive this PR to work for one pooling type per request already.

maxdebayser avatar Jul 24 '25 17:07 maxdebayser

We can support different task per request in the model runner, but this isn't exposed in the API server yet

DarkLight1337 avatar Jul 24 '25 17:07 DarkLight1337

@DarkLight1337 , I've updated the PR now that we have V1 embeddings and the new task refactoring. The new request form is:

curl -s http://localhost:8000/pooling    -H "Content-Type: application/json"    -d '{
     "model": "BAAI/bge-m3",
     "task": "embed-sparse",
     "input": ["What is BGE M3?", "Defination of BM25"]
}' | jq
{
  "id": "pool-f3ea25d3e28d4b40b686092badd99f91",
  "object": "list",
  "created": 1755018267,
  "model": "BAAI/bge-m3",
  "data": [
    {
      "index": 0,
      "object": "pooling",
      "data": [
        0.08349609375,
        0.0814208984375,
        0.1295166015625,
        0.251708984375,
        0.1700439453125,
        0.26953125,
        0.04083251953125
      ]
    },
    {
      "index": 1,
      "object": "pooling",
      "data": [
        0.05010986328125,
        0.136962890625,
        0.045013427734375,
        0.06341552734375,
        0.25146484375,
        0.33349609375
      ]
    }
  ],
  "usage": {
    "prompt_tokens": 17,
    "total_tokens": 17,
    "completion_tokens": 0,
    "prompt_tokens_details": null
  }
}

As a PoC, I created a new task "embed-sparse", but I'm not 100% happy with it, I don't think it will scale if we have to add many different new tasks. Maybe we should add sub-tasks that are model-defined that the dispatcher can use to route the requests.

Another point is that the output is not very expressive. To get the tokens the user would have to have to call tokenize and match the tokens with the embeddings by position. I think we should make the PoolingResponse more generic to add task-specific outputs. This is related to the discussion #21621

Finally, I'm not sure what the best way to test this model is. We could test it against the outputs of the FlagEmbedding library, but that means that we would have to add yet another dependency, which I think we already have to many of. Maybe we could just test a request against a known output.

maxdebayser avatar Aug 12 '25 17:08 maxdebayser

I'm not 100% happy with it, I don't think it will scale if we have to add many different new tasks

Agreed. Currently we allow the Pooler to define their own list of supported tasks but in order for those tasks to work, we also have to update the PoolingParams checking and request dispatching, which could be quite complicated. Having subtask would allow us to keep using the existing logic for the base task.

DarkLight1337 avatar Aug 13 '25 00:08 DarkLight1337

Another point is that the output is not very expressive. To get the tokens the user would have to have to call tokenize and match the tokens with the embeddings by position. I think we should make the PoolingResponse more generic to add task-specific outputs.

Yeah, I see now the need for having a registry for each task to override how to transform the response. This would greatly improve the user experience when using encode method.

DarkLight1337 avatar Aug 13 '25 00:08 DarkLight1337

Finally, I'm not sure what the best way to test this model is.

We can generate the ground truth locally using FlagEmbedding (set up a helper function so it is easy for us to update the result in case of version changes), and then inside the CI we compare our impl to those generated results.

DarkLight1337 avatar Aug 13 '25 00:08 DarkLight1337

Now that @noooop has added support mulit-vector retrieval with the token_embed and token_classify tasks, I've refactored this PR in terms of these tasks.

To start the server, the architecture has to be overriden because otherwise the extra weight file won't be loaded for sparse embeddings (lexical weight).

vllm serve BAAI/bge-m3 --hf-overrides '{"architectures": ["BgeM3EmbeddingModel"]}'

With this setting, the server supports regular dense embedding, token_embed and token_classify:

curl -s http://localhost:8000/pooling    -H "Content-Type: application/json"    -d '{
     "model": "BAAI/bge-m3",
     "task": "token_classify", # this returns the lexical weights
     "input": ["What is BGE M3?", "Defination of BM25"]
}'
curl -s http://localhost:8000/pooling    -H "Content-Type: application/json"    -d '{
     "model": "BAAI/bge-m3",
     "task": "token_embed",
     "input": ["What is BGE M3?", "Defination of BM25"]
}'

Please note that the token_classify request will return an array if scores and not a dict of decoded tokens to their scores. The API currently doesn't support rich formats like that.

The lexical weights can also be retrieved with the offline API:

llm = LLM(
    model="BAAI/bge-m3",
    runner="pooling",
    enforce_eager=True,
    hf_overrides={"architectures": ["BgeM3EmbeddingModel"]})

outputs = llm.encode(prompts, pooling_task="token_classify")

cc: @DarkLight1337

maxdebayser avatar Oct 16 '25 20:10 maxdebayser

Hi, vllm team! Is there any update or plan to merge this into the main branch? If there is anything I can do to help with supporting this feature in vLLM, I’d love to contribute and collaborate!

staugust avatar Nov 27 '25 07:11 staugust

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @maxdebayser.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Nov 27 '25 07:11 mergify[bot]