text-embeddings-inference Support of Docker/Kubernetes CPU limit/reservation

Feature request

Docker (swarm) and Kubernetes have a way to limit CPU usage of a container.

Docker (swarm):

version: '3.4'
services:
    text-embeddings:
        image: ghcr.io/huggingface/text-embeddings-inference:cpu-0.6
        deploy:
            resources:
                limits:
                    cpus: '2'

Kubernetes:

apiVersion: v1
kind: Pod
metadata:
  name: text-embeddings
spec:
  containers:
  - name: text-embeddings
    image: ghcr.io/huggingface/text-embeddings-inference:cpu-0.6
    resources:
      requests:
        cpu: "2000m"
      limits:
        cpu: "2000m"

However, for that to be working optimally (see motivation below), the application in the container has to be aware of this limit and to allocate thread pools accordingly.

Motivation

If thread pools don't match the CPU limit, the container is throttled and performance will drop way beyond expectations (6 times slower in the example below).

For instance on my core i3-8300H (4 cores, 8 threads).

I'm evaluating performance with the following apache bench command (a request containing a single 17KB text to be processed with the sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 model:

ab -k -n24 -c4 -p req-en-17k-b1-huggingface.json -T application/json localhost:18083/embed

configuration	reqs/sec	avg. CPU usage (top)
no cpu limit	16.87	465%
cpuset=0,1	11.48	185%
cpus=2	1.82	200%
cpus=2 + env vars	11.03	150%

You can see on line 3 that cpu=2 (without environment variables) performance are 6 times slower than using cpuset=0,1. The problem is neither Kubernetes nor Docker Swarm allow the cpuset option. You can see on line 4 that adding environment variables controlling the number of threads has a positive impact on performances (almost on par with cpuset=0,1).

no cpu limit configuration:

version: '3.4'
services:
    multiminilml12v2:
        image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.0
        environment:
            - MODEL_ID=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
            - NVIDIA_DISABLE_REQUIRE=1
            - RUST_BACKTRACE=full
            - JSON_OUTPUT=true
            - PORT=18083
            - MAX_BATCH_TOKENS=65536
            - MAX_CLIENT_BATCH_SIZE=1024

cpuset=0,1 configuration:

version: '3.4'
services:
    multiminilml12v2:
        image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.0
        environment:
            - MODEL_ID=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
            - NVIDIA_DISABLE_REQUIRE=1
            - RUST_BACKTRACE=full
            - JSON_OUTPUT=true
            - PORT=18083
            - MAX_BATCH_TOKENS=65536
            - MAX_CLIENT_BATCH_SIZE=1024
        cpuset: "0,1"

cpus=2 configuration:

version: '3.4'
services:
    multiminilml12v2:
        image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.0
        environment:
            - MODEL_ID=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
            - NVIDIA_DISABLE_REQUIRE=1
            - RUST_BACKTRACE=full
            - JSON_OUTPUT=true
            - PORT=18083
            - MAX_BATCH_TOKENS=65536
            - MAX_CLIENT_BATCH_SIZE=1024
        deploy:
            resources:
                limits:
                    cpus: '2'

cpus=2 + env vars configuration:

version: '3.4'
services:
    multiminilml12v2:
        image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.0
        environment:
            - MODEL_ID=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
            - NVIDIA_DISABLE_REQUIRE=1
            - RUST_BACKTRACE=full
            - JSON_OUTPUT=true
            - PORT=18083
            - MAX_BATCH_TOKENS=65536
            - MAX_CLIENT_BATCH_SIZE=1024
            # interesting variables below
            - TOKIO_WORKER_THREADS=1
            - NUM_RAYON_THREADS=1
            - MKL_NUM_THREADS=1
            - MKL_DOMAIN_NUM_THREADS="MKL_BLAS=1"
            - OMP_NUM_THREADS=1
            - MKL_DYNAMIC="FALSE"
            - OMP_DYNAMIC="FALSE"            
        deploy:
            resources:
                limits:
                    cpus: '2'

Your contribution

I'm afraid I can't do much more than this.

Please note that I don't really now which of my environment variables do really have an impact on performance since I am totally unaware of the internals of text-embeddings-inference.

The issue is known though: https://danluu.com/cgroup-throttling/ https://nemre.medium.com/is-your-go-application-really-using-the-correct-number-of-cpu-cores-20915d2b6ccb

Some ecosystems have begun to take that into account.

For instance since Python 3.13 you can fool Python into thinking it has less CPU using an environment variable:

docker run --rm -it --name py13 -e PYTHON_CPU_COUNT=2  python:3.13.0a4-slim  python -c "import os; print(os.cpu_count())"
2

Java does that automatically since java 15:

docker run --rm -it --name java23 --entrypoint /bin/bash openjdk:23-slim 
root@31e4b2de8fad:/# jshell
jshell>  System.out.println(Runtime.getRuntime().availableProcessors());
8

ocker run --rm -it --name java23 --cpus=2 --entrypoint /bin/bash openjdk:23-slim 
root@1935b08ebcf7:/# jshell
jshell> System.out.println(Runtime.getRuntime().availableProcessors());
2

Feb 26 '24 13:02 bfreuden

Super interesting! Can you try:

version: '3.4'
services:
    multiminilml12v2:
        image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.0
        environment:
            - MODEL_ID=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
            - NVIDIA_DISABLE_REQUIRE=1
            - RUST_BACKTRACE=full
            - JSON_OUTPUT=true
            - PORT=18083
            - MAX_BATCH_TOKENS=65536
            - MAX_CLIENT_BATCH_SIZE=1024
            # interesting variables below
            - MKL_NUM_THREADS=1
            - MKL_DOMAIN_NUM_THREADS="MKL_BLAS=1"
            - MKL_DYNAMIC="FALSE"
        deploy:
            resources:
                limits:
                    cpus: '2'

?

TEI uses https://crates.io/crates/num_cpus internally which correctly get the number of CPUs. My guess is that only MKL is confused.

Feb 26 '24 17:02 OlivierDehaene

@OlivierDehaene thanks for you fast answer!

Your configuration is working fine:

configuration	reqs/sec	avg. CPU usage (top)
no cpu limit	16.87	465%
cpuset=0,1	11.48	185%
cpus=2	1.82	200%
cpus=2 + env vars	11.03	150%
cpus=2 + your conf	10.92	150%

There is a slight difference but it doesn't mean anything (I kept the 11.03 value for the cpus=2 + env vars configuration but its result was 10.90 today).

Feb 27 '24 16:02 bfreuden

Ok then I'm not sure there is a lot that can be done here besides adding some documentation to explain this issue in the README/docs.

Feb 28 '24 10:02 OlivierDehaene

Do you think TEI can set those environment variables at the beginning of its startup phase?

Feb 29 '24 09:02 bfreuden

Setting these values correctly would be really hard since they are MKL/runtime specific. Plus these should be set before execution so this imply creating a launching script above the TEI bin.

I think it's better to document the problem and let users find the best values for their specific env. It's more a design decision than anything.

Feb 29 '24 09:02 OlivierDehaene

Setting these values correctly would be really hard since they are MKL/runtime specific. let users find the best values for their specific env I think it's better to document the problem

So it would be great if the documentation could state a rule of thumb (that would be a sensible default) like this:

For people running the CPU-based Docker image with Docker or Kubernetes CPU limits: set those environment variables and replace 1 with the smallest positive integer greater than or equal to the allocated CPU limit. (if that's the correct rule of thumb because I don't know the relationship between MKL_NUM_THREADS and MKL_DOMAIN_NUM_THREADS).

MKL_NUM_THREADS=1
MKL_DOMAIN_NUM_THREADS="MKL_BLAS=1"
MKL_DYNAMIC="FALSE"

Plus these should be set before execution so this imply creating a launching script above the TEI bin.

TEI running on bare metal definitely doesn't require such a script but, since this repo is including the CPU-based Dockerfile, the CPU-based image would definitely be more user-friendly with a script :).

Feb 29 '24 13:02 bfreuden

This issue causes serious performance problems when running on servers with many CPUs.

Example: there are 256 CPUs in a multi-socket system, and a user wants to dedicate 8 CPUs for each text-embedding-interface container in the system. What happens now is: every text-embedding-interface creates 2 x 256 threads. When the threads are squeezed to run on 8 CPUs, the whole service runs so slow that it looks broken.

Worker thread pools need to be populated based on allowed, not all CPUs in the system.

For logs, see Issue https://github.com/opea-project/GenAIExamples/issues/763

Sep 09 '24 12:09 askervin

text-embeddings-inference text-embeddings-inference copied to clipboard

Support of Docker/Kubernetes CPU limit/reservation

Feature request

Motivation

Your contribution

text-embeddings-inference
text-embeddings-inference copied to clipboard