gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

node-feature-discovery of gpu-operator sends excessive LIST requests to the API server

Open jslouisyou opened this issue 1 year ago • 3 comments
trafficstars

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 20.04.4 LTS
  • Kernel Version: 5.4.0-113-generic
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd://1.5.8
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): k8s v.1.21.6
  • GPU Operator Version: v23.3.2

2. Issue or feature description

Hello, NVIDIA gpu-operator team.

I'm not sure if it's appropriate to post this issue here, since node-feature-discovery is maintained from kubernetes-sigs. If you think it's not suitable to post it here, please let me know.

Recently I got several alerts from K8S cluster which describes that API server tooks so long time to serve a LIST request from gpu-operator. Here's the alert and rule that I'm using:

  • Alert:
Long API server 99%-tile Latency
LIST: 29.90 seconds while nfd.k8s-sigs.io/v1alpha1/nodefeatures request.
  • Rule: histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{subresource!~"(log|exec|portforward|proxy)",verb!~"^(?:CONNECT|WATCHLIST|WATCH)$"} [10m])) WITHOUT (instance)) > 10

I also found all gpu-operator-node-feature-discovery-worker pods are tried to send GET verb to API server to query the nodefeatures resource (assumed that this pod needed to get information about node labels). Here's the part of audit log:

{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Request","auditID":"df926f36-8c1f-488e-ac88-11690e24660a","stage":"ResponseComplete","requestURI":"/apis/nfd.k8s-sigs.io/v1alpha1/namespaces/gpu-operator/nodefeatures/sra100-033","verb":"get","user":{"username":"system:serviceaccount:gpu-operator:node-feature-discovery","uid":"da2306ea-536f-455d-bf18-817299dd5489","groups":["system:serviceaccounts","system:serviceaccounts:gpu-operator","system:authenticated"],"extra":{"authentication.kubernetes.io/pod-name":["gpu-operator-node-feature-discovery-worker-49qq6"],"authentication.kubernetes.io/pod-uid":["65dfb997-221e-4a5c-92df-7ff111ea6137"]}},"sourceIPs":["75.17.103.53"],"userAgent":"nfd-worker/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"nodefeatures","namespace":"gpu-operator","name":"sra100-033","apiGroup":"nfd.k8s-sigs.io","apiVersion":"v1alpha1"},"responseStatus":{"metadata":{},"code":200},"requestReceivedTimestamp":"2024-08-07T01:35:20.355504Z","stageTimestamp":"2024-08-07T01:35:20.676700Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"gpu-operator-node-feature-discovery\" of ClusterRole \"gpu-operator-node-feature-discovery\" to ServiceAccount \"node-feature-discovery/gpu-operator\""}}

I think this is strange that it takes this long to process LIST requests when my k8s cluster only has 300 GPU nodes and why node-feature-discovery-worker pods are sending GET request every minute.

Do you have any information about this problem? If there are any parameters that can be changed or if you could provide any ideas, I would be very grateful.

Thanks!

3. Steps to reproduce the issue

4. Information to attach (optional if deemed irrelevant)

  • Logs from gpu-operator-node-feature-discovery-worker pods - query every minute
I0727 13:17:09.124600       1 nfd-worker.go:459] starting feature discovery...
I0727 13:17:09.125214       1 nfd-worker.go:471] feature discovery completed
I0727 13:18:09.255614       1 local.go:115] starting hooks...
I0727 13:18:09.446418       1 nfd-worker.go:459] starting feature discovery...
I0727 13:18:09.446946       1 nfd-worker.go:471] feature discovery completed
I0727 13:19:09.575466       1 local.go:115] starting hooks...
I0727 13:19:09.858354       1 nfd-worker.go:459] starting feature discovery...
I0727 13:19:09.858914       1 nfd-worker.go:471] feature discovery completed
I0727 13:20:10.025155       1 local.go:115] starting hooks...
.... and so on

jslouisyou avatar Aug 07 '24 02:08 jslouisyou

@ArangoGutierrez

cdesiniotis avatar Aug 07 '24 14:08 cdesiniotis

Hello, I'm sorry but are there any updates from this issue?

jslouisyou avatar Sep 30 '24 00:09 jslouisyou

Please raise an issue in https://github.com/kubernetes-sigs/node-feature-discovery

tariq1890 avatar Sep 30 '24 07:09 tariq1890

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.

github-actions[bot] avatar Nov 04 '25 22:11 github-actions[bot]

Closing this one as this should be an issue against https://github.com/kubernetes-sigs/node-feature-discovery and not related to gpu-operator itself.

rajathagasthya avatar Nov 13 '25 00:11 rajathagasthya