text-embeddings-inference
text-embeddings-inference copied to clipboard
Support for Optimum Inference?
Feature request
Are you currently supporting inference of Optimum converted models, e.g. through the ONNX Runtime? I tried a couple of pre-optimized HF models, e.g. this: https://huggingface.co/Xenova/bge-large-en-v1.5 but get these errors (due to a different directory structure of these models):
docker run -it --rm -p 8081:80 ghcr.io/huggingface/text-embeddings-inference:cpu-0.6 --model-id Xenova/bge-large-en-v1.5
2024-01-11T17:10:45.873376Z INFO text_embeddings_router: router/src/main.rs:112: Args { model_id: "Xen***/***-*****-**-v1.5", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, hf_api_token: None, hostname: "69a9402a2c94", port: 80, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), json_output: false, otlp_endpoint: None }
2024-01-11T17:10:45.876608Z INFO hf_hub: /usr/local/cargo/git/checkouts/hf-hub-1aadb4c6e2cbe1ba/b167f69/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"
2024-01-11T17:10:46.150969Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:9: Starting download
Error: Could not download model artifacts
Caused by:
0: request error: HTTP status client error (404 Not Found) for url (https://huggingface.co/Xenova/bge-large-en-v1.5/resolve/main/pytorch_model.bin)
1: HTTP status client error (404 Not Found) for url (https://huggingface.co/Xenova/bge-large-en-v1.5/resolve/main/pytorch_model.bin)
Motivation
Optimum models run much faster and would be great for speeding up embeddings serving. Here are some benchmark numbers from the exact same machine, running the model unconverted with text-embeddings-inference
, and converted / converted+quantized through Transformers.js using onnxruntime-node backend:
text-embeddings-inference - intfloat/multilingual-e5-large 160ms Transformers.js - Xenova/multilingual-e5-large optimized 46.2ms Transformers.js - Xenova/multilingual-e5-large optimized+quantized 43.3ms
I'm not sure how easy it would be to integrate this into TEI, but have you looked into the Nvidia Triton Inference Server? It does require some legwork to get models working with it -- it's not as simple as just converting to ONNX. It's blazing fast though. I've seen some embedding models with 1ms latencies using it.
Onnx-cpu is not supported and is generally faster than TEI (Candle). We are working in Candle to make CPU faster but it's more complicated than for other devices (like Cuda or Metal).
On GPU, TEI is the fastest solution out there.
Onnx-cpu is not supported and is generally faster than TEI (Candle). We are working in Candle to make CPU faster but it's more complicated than for other devices (like Cuda or Metal).
On GPU, TEI is the fastest solution out there.
Is this statement based on some benchmarking done with the Triton Inference Server?
This statement is based on the benchmarks linked at the top of the readme and internal benchmarks done by our partners.
If you want to replicate them, use k6
with the benchmarking scripts in load_tests
.
Today I came across a performance boost by Intel and huggingface to speed up batch inference, dropping the link of the repo here: https://github.com/huggingface/optimum-intel
Yes optimum intel (and generally CPU inference) might be faster with other methods. The main focus of this repo is GPU inference for bulk embeddings.