text-embeddings-inference icon indicating copy to clipboard operation
text-embeddings-inference copied to clipboard

Support for Optimum Inference?

Open jens-totemic opened this issue 5 months ago • 6 comments

Feature request

Are you currently supporting inference of Optimum converted models, e.g. through the ONNX Runtime? I tried a couple of pre-optimized HF models, e.g. this: https://huggingface.co/Xenova/bge-large-en-v1.5 but get these errors (due to a different directory structure of these models):

docker run -it --rm -p 8081:80 ghcr.io/huggingface/text-embeddings-inference:cpu-0.6 --model-id Xenova/bge-large-en-v1.5
2024-01-11T17:10:45.873376Z  INFO text_embeddings_router: router/src/main.rs:112: Args { model_id: "Xen***/***-*****-**-v1.5", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, hf_api_token: None, hostname: "69a9402a2c94", port: 80, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), json_output: false, otlp_endpoint: None }
2024-01-11T17:10:45.876608Z  INFO hf_hub: /usr/local/cargo/git/checkouts/hf-hub-1aadb4c6e2cbe1ba/b167f69/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"    
2024-01-11T17:10:46.150969Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:9: Starting download
Error: Could not download model artifacts

Caused by:
    0: request error: HTTP status client error (404 Not Found) for url (https://huggingface.co/Xenova/bge-large-en-v1.5/resolve/main/pytorch_model.bin)
    1: HTTP status client error (404 Not Found) for url (https://huggingface.co/Xenova/bge-large-en-v1.5/resolve/main/pytorch_model.bin)

Motivation

Optimum models run much faster and would be great for speeding up embeddings serving. Here are some benchmark numbers from the exact same machine, running the model unconverted with text-embeddings-inference, and converted / converted+quantized through Transformers.js using onnxruntime-node backend:

text-embeddings-inference - intfloat/multilingual-e5-large 160ms Transformers.js - Xenova/multilingual-e5-large optimized 46.2ms Transformers.js - Xenova/multilingual-e5-large optimized+quantized 43.3ms

jens-totemic avatar Jan 11 '24 17:01 jens-totemic

I'm not sure how easy it would be to integrate this into TEI, but have you looked into the Nvidia Triton Inference Server? It does require some legwork to get models working with it -- it's not as simple as just converting to ONNX. It's blazing fast though. I've seen some embedding models with 1ms latencies using it.

dcbark01 avatar Jan 19 '24 13:01 dcbark01

Onnx-cpu is not supported and is generally faster than TEI (Candle). We are working in Candle to make CPU faster but it's more complicated than for other devices (like Cuda or Metal).

On GPU, TEI is the fastest solution out there.

OlivierDehaene avatar Jan 22 '24 09:01 OlivierDehaene

Onnx-cpu is not supported and is generally faster than TEI (Candle). We are working in Candle to make CPU faster but it's more complicated than for other devices (like Cuda or Metal).

On GPU, TEI is the fastest solution out there.

Is this statement based on some benchmarking done with the Triton Inference Server?

msminhas93 avatar Feb 14 '24 00:02 msminhas93

This statement is based on the benchmarks linked at the top of the readme and internal benchmarks done by our partners. If you want to replicate them, use k6 with the benchmarking scripts in load_tests.

OlivierDehaene avatar Feb 14 '24 08:02 OlivierDehaene

Today I came across a performance boost by Intel and huggingface to speed up batch inference, dropping the link of the repo here: https://github.com/huggingface/optimum-intel

giyaseddin avatar Apr 03 '24 21:04 giyaseddin

Yes optimum intel (and generally CPU inference) might be faster with other methods. The main focus of this repo is GPU inference for bulk embeddings.

OlivierDehaene avatar Apr 04 '24 08:04 OlivierDehaene