GenAIExamples [Bug] chatqna: xeon pipeline fails (serious performance drop) when CPU affinity of tei and teirerank containers is managed

Priority

P2-High

OS type

Ubuntu

Hardware type

Xeon-SPR

Installation method

[X] Pull docker images from hub.docker.com
[ ] Build docker images from source

Deploy method

[ ] Docker compose
[ ] Docker
[X] Kubernetes
[ ] Helm

Running nodes

Single Node

What's the version?

Observed with latest chatqna.yaml (git 67394b88) where tei and teirerank containers use image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.5

** ctr -n k8s.io images ls | grep text-embeddings **
ghcr.io/huggingface/text-embeddings-inference:cpu-1.5                                                                               application/vnd.oci.image.index.v1+json                   sha256:0502794a4d86974839e701dadd6d06e693ec78a0f6e87f68c391e88c52154f3f 48.2 MiB  linux/amd64                                                                                                                        io.cri-containerd.image=managed
ghcr.io/huggingface/text-embeddings-inference@sha256:0502794a4d86974839e701dadd6d06e693ec78a0f6e87f68c391e88c52154f3f               application/vnd.oci.image.index.v1+json                   sha256:0502794a4d86974839e701dadd6d06e693ec78a0f6e87f68c391e88c52154f3f 48.2 MiB  linux/amd64                                                                                                                        io.cri-containerd.image=managed

Description

When managing CPU affinity (with NRI resource policies or Kubernetes cpu-manager) on a node and creating ChatQnA/kubernetes/manifests/xeon/chatqna.yaml, tei and teirerank containers do not handle properly their internal threading and thread-CPU affinities.

They seem to create a thread for every CPU in the system, yet they should create a thread for every CPU allowed for the container.

In the logs it looks like this:

**kubectl logs -n benchmark chatqna-teirerank-674b878d9c-sdkg9**
...
2024-09-06T07:10:06.082735Z  INFO text_embeddings_router: router/src/lib.rs:241: Starting model backend
2024-09-06T07:10:06.095067Z  WARN ort::environment: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/ort-2.0.0-rc.2/src/environment.rs:266: pthread_setaffinity_np failed for
thread: 80, index: 0, mask: {1, 65, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
2024-09-06T07:10:06.095106Z  WARN ort::environment: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/ort-2.0.0-rc.2/src/environment.rs:266: pthread_setaffinity_np failed for
thread: 81, index: 1, mask: {2, 66, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
2024-09-06T07:10:06.095128Z  WARN ort::environment: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/ort-2.0.0-rc.2/src/environment.rs:266: pthread_setaffinity_np failed for
thread: 82, index: 2, mask: {3, 67, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
2
....
2024-09-06T07:10:06.260526Z  WARN ort::environment: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/ort-2.0.0-rc.2/src/environment.rs:266: pthread_setaffinity_np failed for
thread: 88, index: 8, mask: {9, 73, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
2024-09-06T07:10:08.576066Z  WARN text_embeddings_router: router/src/lib.rs:267: Backend does not support a batch size > 8
2024-09-06T07:10:08.576082Z  WARN text_embeddings_router: router/src/lib.rs:268: forcing `max_batch_requests=8`
2024-09-06T07:10:08.576195Z  WARN text_embeddings_router: router/src/lib.rs:319: Invalid hostname, defaulting to 0.0.0.0
2024-09-06T07:10:08.579399Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1778: Starting HTTP server: 0.0.0.0:2082
2024-09-06T07:10:08.579418Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1779: Ready

And in the system's process/thread's CPU affinity level like this:

**grep Cpus_allowed_list /proc/2370247/task/2370*/status**
...
/proc/2370247/task/2370368/status:Cpus_allowed_list:    40-47
/proc/2370247/task/2370369/status:Cpus_allowed_list:    40-47
/proc/2370247/task/2370370/status:Cpus_allowed_list:    40-47
/proc/2370247/task/2370371/status:Cpus_allowed_list:    40-47
/proc/2370247/task/2370372/status:Cpus_allowed_list:    40-47
/proc/2370247/task/2370373/status:Cpus_allowed_list:    40
/proc/2370247/task/2370374/status:Cpus_allowed_list:    41
/proc/2370247/task/2370375/status:Cpus_allowed_list:    42                                                                                                                                  /proc/2370247/task/2370376/status:Cpus_allowed_list:    43
/proc/2370247/task/2370377/status:Cpus_allowed_list:    44
/proc/2370247/task/2370378/status:Cpus_allowed_list:    45
/proc/2370247/task/2370379/status:Cpus_allowed_list:    46
/proc/2370247/task/2370380/status:Cpus_allowed_list:    47
/proc/2370247/task/2370381/status:Cpus_allowed_list:    40-47
/proc/2370247/task/2370382/status:Cpus_allowed_list:    40-47
/proc/2370247/task/2370383/status:Cpus_allowed_list:    40-47
/proc/2370247/task/2370384/status:Cpus_allowed_list:    40-47
/proc/2370247/task/2370385/status:Cpus_allowed_list:    40-47
/proc/2370247/task/2370386/status:Cpus_allowed_list:    40-47
...

That is, only few threads got correct CPU pinning, the rest (that are way too many) run on all allowed CPUs for the container. As a result this destroys the performance of tei and teirerank on CPU.

The log looks like the ort library is trying to create a thread and set affinity for every CPU in the system while it should not try to use any other than allowed CPUs (limited by cgroups cpuset.cpus). Cannot say if the root cause is in the ort library or how it is used here.

Reproduce steps

Install the balloons NRI policy to manage CPUs.

helm repo add nri-plugins https://containers.github.io/nri-plugins
helm install balloons nri-plugins/nri-resource-policy-balloons --set patchRuntimeConfig=true

Replace the default balloons configuration with one that runs tei/tei-rerank on dedicated CPUs.

cat > chatqna-balloons.yaml << EOF
apiVersion: config.nri/v1alpha1
kind: BalloonsPolicy
metadata:
  name: default
  namespace: kube-system
spec:
  allocatorTopologyBalancing: true
  balloonTypes:
  - name: tgi
    allocatorPriority: high
    minCPUs: 32
    minBalloons: 1
    preferNewBalloons: true
    hideHyperthreads: true
    matchExpressions:
    - key: name
      operator: Equals
      values: ["tgi"]
  - name: embedding
    allocatorPriority: high
    minCPUs: 16
    minBalloons: 2
    preferNewBalloons: true
    hideHyperthreads: true
    matchExpressions:
    - key: name
      operator: In
      values:
      - tei
      - teirerank
  - allocatorPriority: normal
    minCPUs: 14
    hideHyperthreads: false
    name: default
    namespaces:
    - "*"
  log:
    debug: ["policy"]
  pinCPU: true
  pinMemory: false
  reservedPoolNamespaces:
  - kube-system
  reservedResources:
    cpu: "2"
EOF

kubectl delete -n kube-system balloonspolicy default
kubectl create -n kube-system -f balloons-chatqna.yaml

Deploy the chatqna yaml

kubectl create -f ChatQnA/kubernetes/manifests/xeon/chatqna.yaml

Follow logs from chatqna-tei and chatqna-teirerank.

Raw log

No response

Sep 06 '24 09:09 askervin

Some potentially relevant options from TEI readme: https://github.com/huggingface/text-embeddings-inference/blob/main/README.md

which could be used when TEI containers are running on CPUs:

      --tokenization-workers <TOKENIZATION_WORKERS>
          Optionally control the number of tokenizer workers used for payload tokenization, validation and truncation.
          Default to the number of CPU cores on the machine

          [env: TOKENIZATION_WORKERS=]
...
      --max-concurrent-requests <MAX_CONCURRENT_REQUESTS>
          The maximum amount of concurrent requests for this particular deployment.
          Having a low limit will refuse clients requests instead of having them wait for too long and is usually good
          to handle backpressure correctly

          [env: MAX_CONCURRENT_REQUESTS=]
          [default: 512]

Currently only --auto-truncate option is used:

https://github.com/opea-project/GenAIInfra/blob/main/helm-charts/common/teirerank/templates/deployment.yaml#L52
https://github.com/opea-project/GenAIInfra/blob/main/helm-charts/common/tei/templates/deployment.yaml#L52
- https://github.com/opea-project/GenAIInfra/blob/main/helm-charts/common/tei/templates/configmap.yaml

Sep 06 '24 11:09 eero-t

This bug blocks proper ChatQnA platform optimization demo on Xeon.

Sep 06 '24 12:09 askervin

@eero-t, thanks for the pointers.

Adjusting --tokenization-workers 8 dropped thread count from 139 to 82 in my test system (128 vCPUs). But it did not effect pinning. --max-concurrent-requests had no effect what so ever.

kubectl logs -n akervine chatqna-tei-...

When run with limited tokenization-workers looks like this:

...
2024-09-09T06:21:46.741775Z  INFO text_embeddings_core::tokenization: core/src/tokenization.rs:28: Starting 8 tokenization workers
2024-09-09T06:21:46.773151Z  INFO text_embeddings_router: router/src/lib.rs:241: Starting model backend
2024-09-09T06:21:46.786104Z  WARN ort::environment: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/ort-2.0.0-rc.2/src/environment.rs:266: pthread_setaffinity_np failed for thread: 26, index: 2, mask: {3, 67, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
2024-09-09T06:21:46.786114Z  WARN ort::environment: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/ort-2.0.0-rc.2/src/environment.rs:266: pthread_setaffinity_np failed for thread: 25, index: 1, mask: {2, 66, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
2024-09-09T06:21:46.786115Z  WARN ort::environment: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/ort-2.0.0-rc.2/src/environment.rs:266: pthread_setaffinity_np failed for thread: 24, index: 0, mask: {1, 65, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
...

while corresponding lines without tokenization-workers limit are:

...
2024-09-09T06:32:24.471354Z  INFO text_embeddings_core::tokenization: core/src/tokenization.rs:28: Starting 64 tokenization workers
2024-09-09T06:32:24.708141Z  INFO text_embeddings_router: router/src/lib.rs:241: Starting model backend
2024-09-09T06:32:24.720606Z  WARN ort::environment: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/ort-2.0.0-rc.2/src/environment.rs:266: pthread_setaffinity_np failed for thread: 79, index: 0, mask: {1, 65, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
2024-09-09T06:32:24.720644Z  WARN ort::environment: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/ort-2.0.0-rc.2/src/environment.rs:266: pthread_setaffinity_np failed for thread: 80, index: 1, mask: {2, 66, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
...

It looks like there are two pools of threads: tokenization workers and model backend. By default both contain equally many threads as there are physical CPU cores in the whole system. Each model backend thread tries to get a CPU affinity to both hyperthreads of each physical CPU core in the system (above output "mask: 1, 65" are hyperthreads of the same CPU core). Obviously only few succeed, as they happen to ask affinity to the CPUs that are included in the allowed CPUs for the container. There is no CPU affinity for threads in the tokenization worker pool.

@yinghu5, @yongfengdu, I think this issue is not limited to Kubernetes. The same problem is expected when using docker with --cpuset-cpus.

Sep 09 '24 06:09 askervin

Opened two more precisely targeted bug reports against text-embeddings-interface, because the above issue has been written as a feature request. This is a bug that, for comparison, does not exist in text-generation-interface.

Links to issues: Tokenizer threads: https://github.com/huggingface/text-embeddings-inference/issues/404 Model backend threads: https://github.com/huggingface/text-embeddings-inference/issues/405

Sep 11 '24 11:09 askervin

Did you find any ENV/parameter settings that can workaround this? If there is workaround, we can implement them in the helm chart before upstreaming fixes. This is what mentioned in that issue but not sure if that works: - MKL_NUM_THREADS=1 - MKL_DOMAIN_NUM_THREADS="MKL_BLAS=1" - MKL_DYNAMIC="FALSE"

Sep 12 '24 02:09 yongfengdu

Yes. I did not see any effect. There were still too many threads and CPU affinity errors.

(Even in the case that there would be a workaround that drops the thread count down to 1, it would also drop the performance in the inference to a fraction of what it could be when using all allowed CPUs...)

Sep 16 '24 12:09 askervin

Upstream fixed the related issues with https://github.com/huggingface/text-embeddings-inference/pull/410 (exactly 2 weeks ago). It's assumed this can be closed once there's new TEI release and OPEA is updated to that.

Oct 02 '24 14:10 eero-t

Upstream fixed the related issues with huggingface/text-embeddings-inference#410 (exactly 2 weeks ago). It's assumed this can be closed once there's new TEI release and OPEA is updated to that.

Close since Upstream fixed the related issues

Mar 11 '25 01:03 xiguiw

Close since Upstream fixed the related issues

From the upstream releases page: https://github.com/huggingface/text-embeddings-inference/releases

One can see that the fixes are in 1.5.1 and later releases, with latest being 1.6.0.

However, OPEA GenAIExamples repo seems to be still stuck at 1.5:

$ git grep text-embeddings-inference: | grep -F -e 1.5 | wc -l
68
$ git grep text-embeddings-inference: | grep -F -e 1.5. -e 1.6 | wc -l
0

Or, if you're saying that 1.5 is a floating tag, that gets update to latest point release, then it needs to use a image downloading policy that checks new versions even when one is already present locally.

Docker compose spec includes pull_policy attribute, but that's not used by GenAIExamples:

$ git grep pull_policy | wc -l
0

And spec does not say what's the default: https://github.com/compose-spec/compose-spec/blob/main/spec.md#pull_policy

Then there's also GenAIInfra Helm charts, those use the Kubernetes default policy: https://github.com/opea-project/GenAIInfra/pull/587

Which is to always pull latest tag, and other tags only image is not already present. I.e. this would randomly break depending on whether some node had pulled earlier 1.5 tag (mapping to 1.5.0 image).

Mar 11 '25 10:03 eero-t