text-embeddings-inference
text-embeddings-inference copied to clipboard
Support of Docker/Kubernetes CPU limit/reservation
Feature request
Docker (swarm) and Kubernetes have a way to limit CPU usage of a container.
Docker (swarm):
version: '3.4'
services:
text-embeddings:
image: ghcr.io/huggingface/text-embeddings-inference:cpu-0.6
deploy:
resources:
limits:
cpus: '2'
Kubernetes:
apiVersion: v1
kind: Pod
metadata:
name: text-embeddings
spec:
containers:
- name: text-embeddings
image: ghcr.io/huggingface/text-embeddings-inference:cpu-0.6
resources:
requests:
cpu: "2000m"
limits:
cpu: "2000m"
However, for that to be working optimally (see motivation below), the application in the container has to be aware of this limit and to allocate thread pools accordingly.
Motivation
If thread pools don't match the CPU limit, the container is throttled and performance will drop way beyond expectations (6 times slower in the example below).
For instance on my core i3-8300H (4 cores, 8 threads).
I'm evaluating performance with the following apache bench command (a request containing a single 17KB text to be processed with the sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 model
:
ab -k -n24 -c4 -p req-en-17k-b1-huggingface.json -T application/json localhost:18083/embed
configuration | reqs/sec | avg. CPU usage (top) |
---|---|---|
no cpu limit | 16.87 | 465% |
cpuset=0,1 | 11.48 | 185% |
cpus=2 | 1.82 | 200% |
cpus=2 + env vars | 11.03 | 150% |
You can see on line 3 that cpu=2
(without environment variables) performance are 6 times slower than using cpuset=0,1
.
The problem is neither Kubernetes nor Docker Swarm allow the cpuset option.
You can see on line 4 that adding environment variables controlling the number of threads has a positive impact on performances (almost on par with cpuset=0,1
).
no cpu limit
configuration:
version: '3.4'
services:
multiminilml12v2:
image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.0
environment:
- MODEL_ID=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- NVIDIA_DISABLE_REQUIRE=1
- RUST_BACKTRACE=full
- JSON_OUTPUT=true
- PORT=18083
- MAX_BATCH_TOKENS=65536
- MAX_CLIENT_BATCH_SIZE=1024
cpuset=0,1
configuration:
version: '3.4'
services:
multiminilml12v2:
image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.0
environment:
- MODEL_ID=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- NVIDIA_DISABLE_REQUIRE=1
- RUST_BACKTRACE=full
- JSON_OUTPUT=true
- PORT=18083
- MAX_BATCH_TOKENS=65536
- MAX_CLIENT_BATCH_SIZE=1024
cpuset: "0,1"
cpus=2
configuration:
version: '3.4'
services:
multiminilml12v2:
image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.0
environment:
- MODEL_ID=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- NVIDIA_DISABLE_REQUIRE=1
- RUST_BACKTRACE=full
- JSON_OUTPUT=true
- PORT=18083
- MAX_BATCH_TOKENS=65536
- MAX_CLIENT_BATCH_SIZE=1024
deploy:
resources:
limits:
cpus: '2'
cpus=2 + env vars
configuration:
version: '3.4'
services:
multiminilml12v2:
image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.0
environment:
- MODEL_ID=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- NVIDIA_DISABLE_REQUIRE=1
- RUST_BACKTRACE=full
- JSON_OUTPUT=true
- PORT=18083
- MAX_BATCH_TOKENS=65536
- MAX_CLIENT_BATCH_SIZE=1024
# interesting variables below
- TOKIO_WORKER_THREADS=1
- NUM_RAYON_THREADS=1
- MKL_NUM_THREADS=1
- MKL_DOMAIN_NUM_THREADS="MKL_BLAS=1"
- OMP_NUM_THREADS=1
- MKL_DYNAMIC="FALSE"
- OMP_DYNAMIC="FALSE"
deploy:
resources:
limits:
cpus: '2'
Your contribution
I'm afraid I can't do much more than this.
Please note that I don't really now which of my environment variables do really have an impact on performance since I am totally unaware of the internals of text-embeddings-inference.
The issue is known though: https://danluu.com/cgroup-throttling/ https://nemre.medium.com/is-your-go-application-really-using-the-correct-number-of-cpu-cores-20915d2b6ccb
Some ecosystems have begun to take that into account.
For instance since Python 3.13 you can fool Python into thinking it has less CPU using an environment variable:
docker run --rm -it --name py13 -e PYTHON_CPU_COUNT=2 python:3.13.0a4-slim python -c "import os; print(os.cpu_count())"
2
Java does that automatically since java 15:
docker run --rm -it --name java23 --entrypoint /bin/bash openjdk:23-slim
root@31e4b2de8fad:/# jshell
jshell> System.out.println(Runtime.getRuntime().availableProcessors());
8
ocker run --rm -it --name java23 --cpus=2 --entrypoint /bin/bash openjdk:23-slim
root@1935b08ebcf7:/# jshell
jshell> System.out.println(Runtime.getRuntime().availableProcessors());
2
Super interesting! Can you try:
version: '3.4'
services:
multiminilml12v2:
image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.0
environment:
- MODEL_ID=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- NVIDIA_DISABLE_REQUIRE=1
- RUST_BACKTRACE=full
- JSON_OUTPUT=true
- PORT=18083
- MAX_BATCH_TOKENS=65536
- MAX_CLIENT_BATCH_SIZE=1024
# interesting variables below
- MKL_NUM_THREADS=1
- MKL_DOMAIN_NUM_THREADS="MKL_BLAS=1"
- MKL_DYNAMIC="FALSE"
deploy:
resources:
limits:
cpus: '2'
?
TEI uses https://crates.io/crates/num_cpus internally which correctly get the number of CPUs. My guess is that only MKL is confused.
@OlivierDehaene thanks for you fast answer!
Your configuration is working fine:
configuration | reqs/sec | avg. CPU usage (top) |
---|---|---|
no cpu limit | 16.87 | 465% |
cpuset=0,1 | 11.48 | 185% |
cpus=2 | 1.82 | 200% |
cpus=2 + env vars | 11.03 | 150% |
cpus=2 + your conf | 10.92 | 150% |
There is a slight difference but it doesn't mean anything (I kept the 11.03 value for the cpus=2 + env vars
configuration but its result was 10.90
today).
Ok then I'm not sure there is a lot that can be done here besides adding some documentation to explain this issue in the README/docs.
Do you think TEI can set those environment variables at the beginning of its startup phase?
Setting these values correctly would be really hard since they are MKL/runtime specific. Plus these should be set before execution so this imply creating a launching script above the TEI bin.
I think it's better to document the problem and let users find the best values for their specific env. It's more a design decision than anything.
Setting these values correctly would be really hard since they are MKL/runtime specific. let users find the best values for their specific env I think it's better to document the problem
So it would be great if the documentation could state a rule of thumb (that would be a sensible default) like this:
For people running the CPU-based Docker image with Docker or Kubernetes CPU limits: set those environment variables and replace 1
with the smallest positive integer greater than or equal to the allocated CPU limit. (if that's the correct rule of thumb because I don't know the relationship between MKL_NUM_THREADS and MKL_DOMAIN_NUM_THREADS).
MKL_NUM_THREADS=1
MKL_DOMAIN_NUM_THREADS="MKL_BLAS=1"
MKL_DYNAMIC="FALSE"
Plus these should be set before execution so this imply creating a launching script above the TEI bin.
TEI running on bare metal definitely doesn't require such a script but, since this repo is including the CPU-based Dockerfile, the CPU-based image would definitely be more user-friendly with a script :).
This issue causes serious performance problems when running on servers with many CPUs.
Example: there are 256 CPUs in a multi-socket system, and a user wants to dedicate 8 CPUs for each text-embedding-interface container in the system. What happens now is: every text-embedding-interface creates 2 x 256 threads. When the threads are squeezed to run on 8 CPUs, the whole service runs so slow that it looks broken.
Worker thread pools need to be populated based on allowed, not all CPUs in the system.
For logs, see Issue https://github.com/opea-project/GenAIExamples/issues/763