What does this PR do?

This pr adds support for Qdrant/all_miniLM_L6_v2_with_attentions I followed https://github.com/Anush008/bm42-rs as a reference and added a new pooling method of bm42.

Tested with running with command

HF_TOKEN=hf_********************************** cargo run --no-default-features -F http -F ort -- --model-id Qdrant/all_miniLM_L6_v2_with_attentions --pooling bm42 --port 5888

Fixes #337

Before submitting

[X] Did you read the contributor guideline, Pull Request section?
[x] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case. Issue #337
[x] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[ ] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

Jul 10 '24 18:07 cdxker

BM42 is super controversial. See https://x.com/Nils_Reimers/status/1809334134088622217 I'm on the side of waiting a bit to see how this evolve.

Jul 11 '24 09:07 OlivierDehaene

To me Qdrant came at as a prety shady actor here so I would take everything they publish with a big grain of salt.

Jul 11 '24 09:07 OlivierDehaene

To me Qdrant came at as a prety shady actor here so I would take everything they publish with a big grain of salt.

Fair. We really just want support for it so we could test on our own datasets and verify how good/bad it actually is. I agree a lot of that launch was shady; especially the quickwit bench portion.

Any feedback on the actual implementation? Because the router function async fn embed_sparse expects a dense vector then sparsifies it is making it super expensive to serve a lot of sparse embedding models. In the case of this model it uses a murmur3 hash which has a very large key Space.

Which means that Infer has to create and send a very large dense vector across a channel. https://github.com/devflowinc/text-embeddings-inference/blob/dc62b26f04394c44aa3469c10ba8880d3076e087/backends/ort/src/bm42.rs#L293-L296

https://github.com/devflowinc/text-embeddings-inference/blob/dc62b26f04394c44aa3469c10ba8880d3076e087/core/src/infer.rs#L193-L204

This will impact adding support for other sparse models as well as this model. It would be a lot better if Infer::embed_sparse either:

Doesn't call the embed function and have each Backend have an embed_sparse function themselves.
Have Infer::embed_spasrse sparsify the vector before it gets sent back across the channel

I don't have enough context on what the best approach for this is, but it would open the opportunity for supporting other sparse embedding models that have larger keyspaces.

Jul 11 '24 18:07 cdxker

Which means that Infer has to create and send a very large dense vector across a channel.

Why is that a problem? Moving a Vec shouldn't be a concern, it's not expensive as far as I know. But yes, having a dedicated embed_sparse function in the backend trait could be interesting.

Jul 15 '24 15:07 OlivierDehaene

Creating a Vector with a max of 9223372036854776000 f32's is pretty expensive. In this case, even if they are all 0 filled

Jul 16 '24 20:07 cdxker

Creating a Vector with a max of 9223372036854776000 f32's is pretty expensive. In this case, even if they are all 0 filled

Ok so when you say "Which means that Infer has to create and send a very large dense vector across a channel." you are only referring to the creation part? As I said the moving part ("send accross a channel") is not an issue.

Jul 17 '24 10:07 OlivierDehaene

feature: support bm42 embeddings

What does this PR do?

Before submitting

Who can review?