feat: support huggingface/text-embeddings-inference for faster embedding inference
Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5. TEI implements many features such as:
- No model graph compilation step
- Metal support for local execution on Macs
- Small docker images and fast boot times. Get ready for true serverless!
- Token based dynamic batching
- Optimized transformers code for inference using Flash Attention, Candle and cuBLASLt
- Safetensors weight loading
- Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
This PR support TEI faster embedding inference with modelcache, the speedup is shown as follows:
Thank you for participating in the ModelCache open-source project; we welcome your involvement, and the addition of huggingface/text-embeddings-inference is a good idea. We offer two suggestions regarding your submission:
1 Using TextEmbeddingsInference as a class name and text_embeddings_inference as a variable name for LazyImport is somewhat generic, users may confuse concepts. It is recommended that names with greater distinction, such as HuggingfaceTEI or Huggingface_TEI, be used to enhance recognizability 2 Given the use of URL requests, it is recommended to add an example to the examples/embedding directory. I have already added the relevant directory, and you can pull the latest main branch to obtain it.
We have merged your commit into the main branch. Thank you for your contributions to the ModelCache project. Best wishes!