text-embeddings-inference Support for BAAI/bge-m3 model

Feature request

BGE-M3, which is distinguished for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.

Multi-Functionality: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.
Multi-Linguality: It can support more than 100 working languages.
Multi-Granularity: It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens.

Motivation

Support for BAAI/bge-m3 model

Your contribution

Support for BAAI/bge-m3 model

Jan 30 '24 06:01 xusenlinzy

This seems to work already

Feb 04 '24 21:02 jonmach

This seems to work already

really? Is it supported now?

Feb 05 '24 05:02 spttt

I didn't realise it wasn't supported. I just loaded it up and it worked. See for example this using /info when using localhost:8080/info:

A test of a vector encoding also comes back fine.

{
  "model_id": "BAAI/bge-m3",
  "model_sha": null,
  "model_dtype": "float16",
  "model_type": {
    "embedding": {
      "pooling": "cls"
    }
  },
  "max_concurrent_requests": 512,
  "max_input_length": 8192,
  "max_batch_tokens": 16384,
  "max_batch_requests": null,
  "max_client_batch_size": 100,
  "tokenization_workers": 12,
  "version": "0.6.0",
  "sha": "6395a7a29624bb7199acf58df086005bb250f35e",
  "docker_label": null
}

Feb 05 '24 08:02 jonmach

Generating normal dense embeddings works fine because bge-m3 is just a regular XLM-Roberta model.

The problem is there's no way to use the sparse or colbert features of this model because they need different linear heads on the model's unpooled output, and right now, it seems like there's no way to get TEI to give back the last_hidden_state of the model, which you need to use those heads.

Maybe TEI could be expanded to have an endpoint/option, that returns the raw output of an embeding model and maybe the masks, as those are needed if we manually want to process the outputs on the client side. That way the bulk of the processing can happen on the TEI side and only the last linear layers need to be applied on the client side.

Feb 05 '24 10:02 LLukas22

@jonmach Hi, could you please share how do you start TEI with bge-m3? I'm having trouble starting the service.

TEI seems not downloding the pooling config file (1_Pooling/config.json), which apears in the model repo.

$ text-embeddings-router --model-id BAAI/bge-m3
2024-02-07T07:36:02.033059Z  INFO text_embeddings_router: router/src/main.rs:112: Args { model_id: "BAA*/**e-m3", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, hf_api_token: None, hostname: "tei-0", port: 80, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), json_output: false, otlp_endpoint: None }
2024-02-07T07:36:02.033206Z  INFO hf_hub: /usr/local/cargo/git/checkouts/hf-hub-1aadb4c6e2cbe1ba/b167f69/src/lib.rs:55: Token file not found "/home/jovyan/.cache/huggingface/token"    
2024-02-07T07:36:12.132887Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:9: Starting download
2024-02-07T07:36:12.136874Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:26: Model artifacts downloaded in 3.991845ms
Error: The `--pooling` arg is not set and we could not find a pooling configuration (`1_Pooling/config.json`) for this model.

Caused by:
    No such file or directory (os error 2)

Feb 07 '24 07:02 edwardzjl

and maybe the masks

@LLukas22 what masks are you talking about?

Feb 07 '24 10:02 OlivierDehaene

@jonmach Hi, could you please share how do you start TEI with bge-m3? I'm having trouble starting the service.

Sure - this is all I do, which gives me the option of starting on various models.

Unless I'm mistaken, I can only run a single model at once and not choose what I want at the time of requesting an embedding.

#model=sentence-transformers/all-MiniLM-L6-v2
#model=sentence-transformers/multi-qa-MiniLM-L6-cos-v1
#model=sentence-transformers/all-distilroberta-v1
#model=intfloat/e5-base-v2
#model=BAAI/bge-large-en-v1.5
#model=intfloat/e5-large-v2
model=BAAI/bge-m3

text-embeddings-router --model-id $model --port 8080 --max-client-batch-size 100

Here is the output:

2024-02-07T10:24:43.548034Z  INFO text_embeddings_router: router/src/main.rs:112: Args { model_id: "BAA*/**e-m3", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 100, hf_api_token: None, hostname: "0.0.0.0", port: 8080, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: None, json_output: false, otlp_endpoint: None }
2024-02-07T10:24:43.551340Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:9: Starting download
2024-02-07T10:24:43.551457Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:26: Model artifacts downloaded in 121.583µs
2024-02-07T10:24:43.786911Z  INFO text_embeddings_core::tokenization: core/src/tokenization.rs:23: Starting 12 tokenization workers
2024-02-07T10:24:44.851695Z  INFO text_embeddings_router: router/src/lib.rs:239: Starting model backend
2024-02-07T10:24:44.858840Z  INFO text_embeddings_backend_candle: backends/candle/src/lib.rs:93: Starting Bert model on Metal(MetalDevice(4294968302))
2024-02-07T10:24:45.913866Z  INFO text_embeddings_router: router/src/lib.rs:310: Ready

Feb 07 '24 10:02 jonmach

@LLukas22 what masks are you talking about?

@OlivierDehaene , I was referring to the attention masks for batched input, as reconstructing these on the client side could be challenging if I prefer not to run the tokenizer there. However, this also depends on how TEI would return the last_hidden_states for a batched input. For example, if TEI were to return a single array of size [batch_size, longest_sequence_length, d_model] and I input a batch with varying sequence lengths, then I would need the attention masks if I want to, for instance, manually perform mean pooling correctly.

When examining the FlagEmbedding implementation, a similar issue arises since their ColBERT linear head also applies the attention mask for batched inference.

def colbert_embedding(self, last_hidden_state, mask):
    colbert_vecs = self.colbert_linear(last_hidden_state[:, 1:])
    colbert_vecs = colbert_vecs * mask[:, 1:][:, :, None].float()
    return colbert_vecs

Another option for TEI would be to return a list of arrays, where each array represents the last_hidden_state of an entry in the batch, meaning that the result would consist of batch_size arrays of potentially different lengths with the "attention mask" already applied on the server side. While this might be easier for users to understand, it also means that performing batched inference on the client side, for example, with a final linear layer, would be much more challenging as those arrays would need to be padded and stacked to process them in batches.

Feb 07 '24 11:02 LLukas22

the result would consist of batch_size arrays of potentially different lengths with the "attention mask" already applied on the server side.

Yes that will be the API. TEI does not use attention masks on GPUs.

those arrays would need to be padded and stacked to process them in batches

No, you just need to concatenate the arrays, run the linear layer then index into the result with the length of the different arrays:

embed_a = torch.randn([12, 1024]) # shape [12, 1024]
embed_b = torch.randn([128, 1024]) # shape[128, 1024]

varlen = [0, embed_a.shape[0], embed_a.shape[0] + embed_b.shape[0]]

linear = torch.nn.Linear(1024, 512)
out = linear(torch.concat([embed_a, embed_b])) # shape [140, 512]

final_a = out[varlen[0] : varlen[1]] # shape [12, 512]
final_b = out[varlen[1] : varlen[2]] # shape [128, 512]

This is a very basic implementation and can be optimized to index more efficiently but that's what we use internally in TEI and TGI to remove padding everywhere we can.

Also, on CPU, this will be way more efficient than the padded way.

Feb 07 '24 11:02 OlivierDehaene

@jonmach It seems I had a broken download. After a cleanup the command works. Thanks!

Feb 07 '24 13:02 edwardzjl

No, you just need to concatenate the arrays, run the linear layer then index into the result with the length of the different arrays:
embed_a = torch.randn([12, 1024]) # shape [12, 1024]
embed_b = torch.randn([128, 1024]) # shape[128, 1024]

varlen = [0, embed_a.shape[0], embed_a.shape[0] + embed_b.shape[0]]

linear = torch.nn.Linear(1024, 512)
out = linear(torch.concat([embed_a, embed_b])) # shape [140, 512]

final_a = out[varlen[0] : varlen[1]] # shape [12, 512]
final_b = out[varlen[1] : varlen[2]] # shape [128, 512]
This is a very basic implementation and can be optimized to index more efficiently but that's what we use internally in TEI and TGI to remove padding everywhere we can.

Also, on CPU, this will be way more efficient than the padded way.

Got it, this actually makes a lot of sense. 👍

Feb 07 '24 15:02 LLukas22

for now https://github.com/puppetm4st3r/baai_m3_simple_server/tree/main its a simple, very simple solution to host the m3 with embeddings and score with transformers implementation but with FastAPI, async, timeout, error and concurrent handling.

may be its helpful for someone :)

Feb 22 '24 19:02 puppetm4st3r

I have tried using tei to host a bge-m3:

text-embeddings-router --model-id /model/bge-m3 --dtype float32 --pooling cls --max-batch-tokens 4194304 -p 40031 --max-client-batch-size 512 --max-batch-requests 512 --max-concurrent-requests 512 --max-input-length 8192 --tokenization-workers 48

However, after hosting for like 1-2 hours, I cannot reach this api any more. And the log looks just fine:

2024-02-23T08:30:07.337475Z  INFO embed{total_time="47.308347ms" tokenization_time="1.446196ms" queue_time="8.160533ms" inference_time="16.841147ms"}: text_embeddings_router::http::server: router/src/http/server.rs:554: Success
2024-02-23T08:30:07.519394Z  INFO embed{total_time="123.939788ms" tokenization_time="1.392177ms" queue_time="24.419153ms" inference_time="72.966591ms"}: text_embeddings_router::http::server: router/src/http/server.rs:554: Success
2024-02-23T08:30:10.308262Z  INFO embed{total_time="2.761004937s" tokenization_time="1.233394ms" queue_time="9.322389ms" inference_time="2.062274461s"}: text_embeddings_router::http::server: router/src/http/server.rs:554: Success
2024-02-23T08:30:10.367692Z  INFO embed{total_time="41.359432ms" tokenization_time="1.107312ms" queue_time="8.722307ms" inference_time="14.065883ms"}: text_embeddings_router::http::server: router/src/http/server.rs:554: Success
2024-02-23T08:30:10.472197Z  INFO embed{total_time="69.639064ms" tokenization_time="1.319081ms" queue_time="9.892876ms" inference_time="43.147685ms"}: text_embeddings_router::http::server: router/src/http/server.rs:554: Success
2024-02-23T08:30:10.547843Z  INFO embed{total_time="47.557083ms" tokenization_time="1.084367ms" queue_time="9.57026ms" inference_time="16.354058ms"}: text_embeddings_router::http::server: router/src/http/server.rs:554: Success
2024-02-23T08:36:33.323552Z  INFO embed{total_time="1.073678088s" tokenization_time="3.991227ms" queue_time="47.265473ms" inference_time="470.999812ms"}: text_embeddings_router::http::server: router/src/http/server.rs:554: Success
2024-02-23T08:36:34.489033Z  INFO embed{total_time="1.038392242s" tokenization_time="3.972168ms" queue_time="14.743635ms" inference_time="935.307304ms"}: text_embeddings_router::http::server: router/src/http/server.rs:554: Success
2024-02-23T08:36:35.163018Z  INFO embed{total_time="541.768504ms" tokenization_time="4.435819ms" queue_time="68.930481ms" inference_time="267.758527ms"}: text_embeddings_router::http::server: router/src/http/server.rs:554: Success
2024-02-23T08:36:36.158408Z  INFO embed{total_time="881.391933ms" tokenization_time="4.366283ms" queue_time="11.35743ms" inference_time="828.172481ms"}: text_embeddings_router::http::server: router/src/http/server.rs:554: Success
2024-02-23T08:36:37.551110Z  INFO embed{total_time="1.25827532s" tokenization_time="4.133095ms" queue_time="67.692048ms" inference_time="540.238021ms"}: text_embeddings_router::http::server: router/src/http/server.rs:554: Success
2024-02-23T08:36:39.549517Z  INFO embed{total_time="1.847077231s" tokenization_time="4.55847ms" queue_time="15.950995ms" inference_time="1.604017682s"}: text_embeddings_router::http::server: router/src/http/server.rs:554: Success
2024-02-23T08:36:40.187346Z  INFO embed{total_time="513.751543ms" tokenization_time="3.718669ms" queue_time="68.060994ms" inference_time="222.132152ms"}: text_embeddings_router::http::server: router/src/http/server.rs:554: Success
2024-02-23T08:36:41.115321Z  INFO embed{total_time="818.978842ms" tokenization_time="3.106115ms" queue_time="14.039839ms" inference_time="734.88417ms"}: text_embeddings_router::http::server: router/src/http/server.rs:554: Success
2024-02-23T08:36:42.297776Z  INFO embed{total_time="1.041857067s" tokenization_time="5.299782ms" queue_time="16.860908ms" inference_time="914.732321ms"}: text_embeddings_router::http::server: router/src/http/server.rs:554: Success
2024-02-23T08:36:42.834815Z  INFO embed{total_time="417.750633ms" tokenization_time="3.347032ms" queue_time="65.375982ms" inference_time="184.086998ms"}: text_embeddings_router::http::server: router/src/http/server.rs:554: Success
2024-02-23T08:36:45.594506Z  INFO embed{total_time="2.625969873s" tokenization_time="4.261676ms" queue_time="36.070208ms" inference_time="1.504996124s"}: text_embeddings_router::http::server: router/src/http/server.rs:554: Success
2024-02-23T08:36:46.408578Z  INFO embed{total_time="686.905122ms" tokenization_time="3.831811ms" queue_time="12.272417ms" inference_time="640.90231ms"}: text_embeddings_router::http::server: router/src/http/server.rs:554: Success
2024-02-23T08:36:46.991094Z  INFO embed{total_time="457.306486ms" tokenization_time="2.93387ms" queue_time="19.056778ms" inference_time="372.190997ms"}: text_embeddings_router::http::server: router/src/http/server.rs:554: Success

I have restarted the service for several times, and every time it broke down. Anyone who has same problem?

Feb 23 '24 08:02 SingL3

I check through watch -n 0.1 nvidia-smi:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:A9:00.0 Off |                    0 |
| N/A   58C    P0             324W / 400W |  31097MiB / 81920MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

and top:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   3011 root      20   0  178.9g  10.7g 179492 S  99.7   1.1  29:12.35 text-embeddings

@OlivierDehaene @jonmach Any suggestions?

Feb 23 '24 09:02 SingL3

The GPU is stuck at 100% util and the model is not answering?

I cannot reach this api any more

Can you GET on /health?

Feb 23 '24 09:02 OlivierDehaene

It's possible that the batch size being just so large breaks something I honestly never benchmarked TEI with a batch of 4 million tokens. Do you see throughput improvements by running at this max-batch-tokens compared to somehting more reasonable?

Feb 23 '24 09:02 OlivierDehaene

The max batch tokens is set to 4194304 but I am actually running with a batch size 50 and max input length 8192.

Feb 23 '24 09:02 SingL3

@OlivierDehaene

The GPU is stuck at 100% util and the model is not answering?

I cannot reach this api any more

Can you GET on /health?

Sorry, I didnt see this. Nothing return calling curl xxx.xxx.xxx.xxx:xxxx/health -X POST -H 'Content-Type: application/json'. I run with python and the response code is 422.

I am now running with a batch size 64(max-batch-tokens=524288), and it still break down.

Feb 26 '24 02:02 SingL3

OK, the max batch size is like 20 for A100.

Feb 26 '24 09:02 SingL3

OK, the max batch size is like 20 for A100.

Interesting, i have no problems running BAAI/bge-m3 with a batch size of >80 on a single H100. But i'm getting a Failed to buffer the request body: length limit exceeded error if my request gets bigger than 2MB.

I'm running my container via: docker run --gpus all -p 8080:80 -v $PWD/data:/data --pull always ghcr.io/huggingface/text-embeddings-inference:hopper-1.0 --model-id BAAI/bge-m3 --max-batch-tokens 4096000 --max-client-batch-size 512

Dummy script im using to send requests to the server:

import httpx
from tokenizers import Tokenizer
import asyncio
from tqdm import tqdm
import sys
import json

text = "hello how are you?" * 1300

async def main():
    async with httpx.AsyncClient() as client:
        payload = {"inputs": [text]*88} # Setting 90 here will exceed the 2MB limit
        print(f"{sys.getsizeof(json.dumps(payload))/(1024*1024)} MB")       
        for _ in tqdm(range(1_000)):
            response = await client.post(
                "http://51.159.147.238:8080/embed",
                json=payload,
            )
            assert response.status_code == 200  
           
if __name__ == "__main__":
    tokenizer = Tokenizer.from_pretrained("BAAI/bge-m3") 
    print(tokenizer.encode(text))
    asyncio.run(main())

Feb 26 '24 10:02 LLukas22

But i'm getting a Failed to buffer the request body: length limit exceeded error if my request gets bigger than 2MB.

Now you can pass --payload-limit <PAYLOAD_LIMIT> to handle this

Apr 05 '24 15:04 xfalcox

The problem is there's no way to use the sparse or colbert features of this model

I too would be interested on the /embed_sparse endpoint working with the bge-m3 model.

Apr 05 '24 15:04 xfalcox

can't serve as reanker model,embedding can work.

docker run --rm  -p 8080:8080  ghcr.io/huggingface/text-embeddings-inference:cpu-1.2  --model-id BAAI/bge-m3

curl 127.0.0.1:8080/embed \
>     -X POST \
>     -d '{"inputs":"What is Deep Learning?"}' \
>     -H 'Content-Type: application/json'
[[-0.0426506,-0.053446304,-0.0135189695,-0.014464944,-0.013017608,-0.02092999,0.004542577,-0.009760644,0.017525507,-0.015674356,-0.017655382,0.009944128,-0.012726562,0.012365224,0.008321619,-0.06599002,-0.010078649,-0.026094593,0.0048552104,-0.04845821,-0.020976849,-0.04162959,0.00019328059,0.0049855304,0.011595723,0.05492824,0.017553054,-0.03543142,-0.026268922,-0.033098128,0.014587652,-0.020304088,0.029781988,-0.018910617,-0.029656466,-0.03942437,0.024668323,-0.014752873,-0.02827941,-0.015004843,-0.023137731,0.047132764,-0.020062488,-0.045326408,0.034355637,-0.03149631,-0.026085582,-0.027719917,0.015249596,-0.0040677967,-0.030534575,-0.013819123,0.06278714,0.020077288,-0.012152808,0.004369956,-0.0032184036,0.008678024,-0.07005509,-0.023579568,0.0018820278,0.004482847,0.007092127,0.038206495,0.060807858,0.07470985,0.002609027,-0.00024154258,-0.0004231959,-0.020194644,0.010048431,0.02775141,-0.04005056,0.013067819,-0.057827156,0.0552481,0.00011962272,0.02514833,-0.06033531,-0.011081212,0.03796447,0.005629248,-0.043302286,-0.0015303063,-0.011835964,0.021213261,-0.007665253,-0.00781968,0.0075604245,-0.025405265,-0.017505145,-0.021158328,0.011225852,-0.056344707,-0.03696825,-0.05598617,-0.004134916,0.02569101,-0.016667847,0.009679811,0.013292633,0.01636087..

curl 127.0.0.1:8080/rerank \

-X POST \
-d '{"query":"What is Deep Learning?", "texts": ["Deep Learning is not...", "Deep learning is..."]}' \
-H 'Content-Type: application/json'

{"error":"Backend error: model is not a re-ranker model","error_type":"Backend"}

Apr 17 '24 03:04 xunfeng1980

@xunfeng1980 for rerank task you have to use https://huggingface.co/BAAI/bge-reranker-v2-m3 model.

Apr 17 '24 04:04 puppetm4st3r

any progress now？

Jul 02 '24 02:07 seetimee

any progress on sparse and colbert embedding?

Jul 02 '24 02:07 xxxfzxxx

Is there any progress on supporting bge-m3 and others for sparse embedding? Which model is working well with /embed_sparse? I want to test /embed_sparse endpoint it can be any model at the moment. -Ikram

Jul 23 '24 11:07 ulhaqi12

text-embeddings-inference text-embeddings-inference copied to clipboard

Support for BAAI/bge-m3 model

Feature request

Motivation

Your contribution

text-embeddings-inference
text-embeddings-inference copied to clipboard