gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Node Feature Discovery showing thousands of extra node features

Open pbundac opened this issue 1 year ago • 1 comments

1. Quick Debug Information

  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): EKS
  • GPU Operator Version: v23.6.1

2. Issue or feature description

When we run kubectl get nodefeature in our gpu operator namespace, it returns more than 4000 entries. However, we only have 70ish nodes in the cluster, and only 9 of them are GPU nodes. We're not sure where we're getting these 4000 entries and whether or not it's safe to delete them.

3. Steps to reproduce the issue

N/A - We're not sure where these entries are coming from.

4. Information to attach (optional if deemed irrelevant)

❯ kubectl get nodefeature -n gpu-operator | grep -vi NAME | wc -l
    4176
❯ kubectl get nodes | grep -vi NAME | wc -l
      73

pbundac avatar Jan 10 '24 21:01 pbundac

Hi @pbundac NFD introduced garbage collection of stale nodefeature objects starting with NFD v0.14.0. I see you are using GPU Operator v23.6.1 which uses NFD v0.13.1: https://github.com/NVIDIA/gpu-operator/blob/dd2fe9c9be315f61ab7012041d44b7a571a2bd4c/deployments/gpu-operator/Chart.yaml#L22

We moved to NFD v0.14.2 starting with GPU Operator v23.9.0. Upgrading the operator to >=v23.9.0 should remove these stale nodefeatures.

cdesiniotis avatar Jan 25 '24 19:01 cdesiniotis