node-feature-discovery icon indicating copy to clipboard operation
node-feature-discovery copied to clipboard

Topology updater is failing to collect NUMA information

Open dittops opened this issue 7 months ago • 12 comments

What happened:

I have installed the 0.17.3 version of nfd using helm. I want to get the numa node topology, so I enabled the topology updater while installing. But numa details was not added in the label. I have multiple numa while checking with lscpu. Here is the log

sdp@fl4u42:~$ kubectl logs -f nfd-node-feature-discovery-topology-updater-g8wl7
I0430 12:06:36.208275       1 nfd-topology-updater.go:163] "Node Feature Discovery Topology Updater" version="v0.17.3" nodeName="fl4u42"
I0430 12:06:36.208337       1 component.go:34] [core]original dial target is: "/host-var/lib/kubelet-podresources/kubelet.sock"
I0430 12:06:36.208357       1 component.go:34] [core][Channel #1]Channel created
I0430 12:06:36.208371       1 component.go:34] [core][Channel #1]parsed dial target is: resolver.Target{URL:url.URL{Scheme:"passthrough", Opaque:"", User:(*url.Userinfo)(nil), Host:"", Path:"//host-var/lib/kubelet-podresources/kubelet.sock", RawPath:"", OmitHost:false, ForceQuery:false, RawQuery:"", Fragment:"", RawFragment:""}}
I0430 12:06:36.208375       1 component.go:34] [core][Channel #1]Channel authority set to "%2Fhost-var%2Flib%2Fkubelet-podresources%2Fkubelet.sock"
I0430 12:06:36.208511       1 component.go:34] [core][Channel #1]Resolver state updated: {
  "Addresses": [
    {
      "Addr": "/host-var/lib/kubelet-podresources/kubelet.sock",
      "ServerName": "",
      "Attributes": null,
      "BalancerAttributes": null,
      "Metadata": null
    }
  ],
  "Endpoints": [
    {
      "Addresses": [
        {
          "Addr": "/host-var/lib/kubelet-podresources/kubelet.sock",
          "ServerName": "",
          "Attributes": null,
          "BalancerAttributes": null,
          "Metadata": null
        }
      ],
      "Attributes": null
    }
  ],
  "ServiceConfig": null,
  "Attributes": null
} (resolver returned new addresses)
I0430 12:06:36.208535       1 component.go:34] [core][Channel #1]Channel switches to new LB policy "pick_first"
I0430 12:06:36.208562       1 component.go:34] [core][Channel #1 SubChannel #2]Subchannel created
I0430 12:06:36.208569       1 component.go:34] [core][Channel #1]Channel Connectivity change to CONNECTING
I0430 12:06:36.208577       1 component.go:34] [core][Channel #1]Channel exiting idle mode
2025/04/30 12:06:36 Connected to '"/host-var/lib/kubelet-podresources/kubelet.sock"'!
I0430 12:06:36.208679       1 component.go:34] [core][Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING
I0430 12:06:36.208720       1 component.go:34] [core][Channel #1 SubChannel #2]Subchannel picks a new address "/host-var/lib/kubelet-podresources/kubelet.sock" to connect
I0430 12:06:36.208987       1 component.go:34] [core][Channel #1 SubChannel #2]Subchannel Connectivity change to READY
I0430 12:06:36.209010       1 component.go:34] [core][Channel #1]Channel Connectivity change to READY
I0430 12:06:36.209018       1 nfd-topology-updater.go:375] "configuration file parsed" path="/etc/kubernetes/node-feature-discovery/nfd-topology-updater.conf" config={"ExcludeList":null}
I0430 12:06:36.209061       1 podresourcesscanner.go:53] "watching all namespaces"
WARNING: failed to read int from file: open /host-sys/devices/system/node/node0/cpu0/online: no such file or directory
I0430 12:06:36.209247       1 metrics.go:44] "metrics server starting" port=":8081"
I0430 12:06:36.267613       1 component.go:34] [core][Server #4]Server created
I0430 12:06:36.267645       1 nfd-topology-updater.go:145] "gRPC health server serving" port=8082
I0430 12:06:36.267690       1 component.go:34] [core][Server #4 ListenSocket #5]ListenSocket created
I0430 12:07:36.217041       1 podresourcesscanner.go:137] "podFingerprint calculated" status=<
        > processing node ""
        > processing 15 pods
        + aibrix-system/aibrix-kuberay-operator-55f5ddcbf4-vqrwb
        + default/nfd-node-feature-discovery-worker-w5cvn
        + aibrix-system/aibrix-redis-master-7bff9b56f5-hs5k4
        + envoy-gateway-system/envoy-gateway-5bfc954ffc-k4tf7
        + kube-system/metrics-server-5985cbc9d7-vh9pb
        + aibrix-system/aibrix-controller-manager-6489d5b587-hj2bt
        + aibrix-system/aibrix-gateway-plugins-58bdc89d9c-q67pp
        + envoy-gateway-system/envoy-aibrix-system-aibrix-eg-903790dc-54766c9758-l68wh
        + kube-system/helm-install-traefik-crd-kz6kg
        + default/nfd-node-feature-discovery-topology-updater-g8wl7
        + kube-system/svclb-envoy-aibrix-system-aibrix-eg-903790dc-1f213b6c-fdvw4
        + aibrix-system/aibrix-gpu-optimizer-75df97858d-5zb5s
        + kube-system/helm-install-traefik-j89k5
        + aibrix-system/aibrix-metadata-service-66f45c85bc-k8pzx
        + kube-system/local-path-provisioner-5cf85fd84d-hgf67
        = pfp0v0011be09f6ff65dbfe0
 >
I0430 12:07:36.217093       1 podresourcesscanner.go:148] "scanning pod" podName="aibrix-kuberay-operator-55f5ddcbf4-vqrwb"
I0430 12:07:36.217115       1 podresourcesscanner.go:231] "pod doesn't have devices" podName="aibrix-kuberay-operator-55f5ddcbf4-vqrwb"
I0430 12:07:36.223315       1 podresourcesscanner.go:148] "scanning pod" podName="nfd-node-feature-discovery-worker-w5cvn"
I0430 12:07:36.223325       1 podresourcesscanner.go:231] "pod doesn't have devices" podName="nfd-node-feature-discovery-worker-w5cvn"
I0430 12:07:36.225915       1 podresourcesscanner.go:148] "scanning pod" podName="aibrix-redis-master-7bff9b56f5-hs5k4"
I0430 12:07:36.225935       1 podresourcesscanner.go:231] "pod doesn't have devices" podName="aibrix-redis-master-7bff9b56f5-hs5k4"
I0430 12:07:36.228169       1 podresourcesscanner.go:148] "scanning pod" podName="envoy-gateway-5bfc954ffc-k4tf7"
I0430 12:07:36.228195       1 podresourcesscanner.go:231] "pod doesn't have devices" podName="envoy-gateway-5bfc954ffc-k4tf7"
I0430 12:07:36.231774       1 podresourcesscanner.go:148] "scanning pod" podName="metrics-server-5985cbc9d7-vh9pb"
I0430 12:07:36.231788       1 podresourcesscanner.go:231] "pod doesn't have devices" podName="metrics-server-5985cbc9d7-vh9pb"
I0430 12:07:36.233367       1 podresourcesscanner.go:148] "scanning pod" podName="aibrix-controller-manager-6489d5b587-hj2bt"
I0430 12:07:36.233374       1 podresourcesscanner.go:231] "pod doesn't have devices" podName="aibrix-controller-manager-6489d5b587-hj2bt"
I0430 12:07:36.234769       1 podresourcesscanner.go:148] "scanning pod" podName="aibrix-gateway-plugins-58bdc89d9c-q67pp"
I0430 12:07:36.234779       1 podresourcesscanner.go:231] "pod doesn't have devices" podName="aibrix-gateway-plugins-58bdc89d9c-q67pp"
I0430 12:07:36.236354       1 podresourcesscanner.go:148] "scanning pod" podName="envoy-aibrix-system-aibrix-eg-903790dc-54766c9758-l68wh"
I0430 12:07:36.236361       1 podresourcesscanner.go:231] "pod doesn't have devices" podName="envoy-aibrix-system-aibrix-eg-903790dc-54766c9758-l68wh"
I0430 12:07:36.238011       1 podresourcesscanner.go:148] "scanning pod" podName="helm-install-traefik-crd-kz6kg"
I0430 12:07:36.238017       1 podresourcesscanner.go:231] "pod doesn't have devices" podName="helm-install-traefik-crd-kz6kg"
I0430 12:07:36.239514       1 podresourcesscanner.go:148] "scanning pod" podName="nfd-node-feature-discovery-topology-updater-g8wl7"
I0430 12:07:36.239521       1 podresourcesscanner.go:231] "pod doesn't have devices" podName="nfd-node-feature-discovery-topology-updater-g8wl7"
I0430 12:07:36.241754       1 podresourcesscanner.go:148] "scanning pod" podName="svclb-envoy-aibrix-system-aibrix-eg-903790dc-1f213b6c-fdvw4"
I0430 12:07:36.241760       1 podresourcesscanner.go:231] "pod doesn't have devices" podName="svclb-envoy-aibrix-system-aibrix-eg-903790dc-1f213b6c-fdvw4"
I0430 12:07:36.422134       1 podresourcesscanner.go:148] "scanning pod" podName="aibrix-gpu-optimizer-75df97858d-5zb5s"
I0430 12:07:36.422165       1 podresourcesscanner.go:231] "pod doesn't have devices" podName="aibrix-gpu-optimizer-75df97858d-5zb5s"
I0430 12:07:36.621889       1 podresourcesscanner.go:148] "scanning pod" podName="helm-install-traefik-j89k5"
I0430 12:07:36.621923       1 podresourcesscanner.go:231] "pod doesn't have devices" podName="helm-install-traefik-j89k5"
I0430 12:07:36.821266       1 podresourcesscanner.go:148] "scanning pod" podName="aibrix-metadata-service-66f45c85bc-k8pzx"
I0430 12:07:36.821294       1 podresourcesscanner.go:231] "pod doesn't have devices" podName="aibrix-metadata-service-66f45c85bc-k8pzx"
I0430 12:07:37.022025       1 podresourcesscanner.go:148] "scanning pod" podName="local-path-provisioner-5cf85fd84d-hgf67"
I0430 12:07:37.022057       1 podresourcesscanner.go:231] "pod doesn't have devices" podName="local-path-provisioner-5cf85fd84d-hgf67"
I0430 12:07:37.432143       1 metrics.go:51] "stopping metrics server" port=":8081"
I0430 12:07:37.432207       1 metrics.go:45] "metrics server stopped" exitCode="http: Server closed"
E0430 12:07:37.432223       1 main.go:66] "error while running" err="failed to create NodeResourceTopology: the server could not find the requested resource (post noderesourcetopologies.topology.node.k8s.io)"

lscpu snippet

NUMA:                    
  NUMA node(s):          8
  NUMA node0 CPU(s):     0-13,112-125
  NUMA node1 CPU(s):     14-27,126-139
  NUMA node2 CPU(s):     28-41,140-153
  NUMA node3 CPU(s):     42-55,154-167
  NUMA node4 CPU(s):     56-69,168-181
  NUMA node5 CPU(s):     70-83,182-195
  NUMA node6 CPU(s):     84-97,196-209
  NUMA node7 CPU(s):     98-111,210-223

Environment:

  • Kubernetes version (use kubectl version): v1.31.3+k3s1
  • Cloud provider or hardware configuration: Onprem hardware, Intel(R) Xeon(R) Platinum 8480+, 512GB
  • OS (e.g: cat /etc/os-release): Ubuntu 23.04
  • Kernel (e.g. uname -a): 6.2.0-39-generic
  • Install tools:
  • Network plugin and version (if this is a network-related bug):
  • Others:

dittops avatar Apr 30 '25 12:04 dittops

/cc @PiotrProkop @ffromani

marquiz avatar May 02 '25 08:05 marquiz

@dittops what about NRT resource? Have you checked if the NUMA architecture was expoised there? I just did it a quick test

apiVersion: topology.node.k8s.io/v1alpha2
attributes:
- name: topologyManagerPolicy
  value: none
- name: topologyManagerScope
  value: container
- name: nodeTopologyPodsFingerprint
  value: pfp0v0014f8589198c824330
kind: NodeResourceTopology
metadata:
  creationTimestamp: "2025-07-10T10:13:05Z"
  generation: 7
  name: eseldb12u01
  ownerReferences:
  - apiVersion: v1
    kind: Namespace
    name: node-feature-discovery
    uid: 78a9cfdc-8033-4743-bcfd-c9b6be5ef65d
  resourceVersion: "453128"
  uid: a983b8da-9cc8-4284-8edb-36ab7ae17d6b
topologyPolicies:
- None
zones:
- costs:
  - name: node-0
    value: 10
  - name: node-1
    value: 21
  name: node-0
  resources:
  - allocatable: "0"
    available: "0"
    capacity: "64"
    name: cpu
  type: Node
- costs:
  - name: node-0
    value: 21
  - name: node-1
    value: 10
  name: node-1
  resources:
  - allocatable: "0"
    available: "0"
    capacity: "64"
    name: cpu

type: Node

fmuyassarov avatar Jul 10 '25 10:07 fmuyassarov

and from the labels I can see numa set to true

kubectl get nodes -oyaml | grep numa
      feature.node.kubernetes.io/memory-numa: "true"

which is set in https://github.com/kubernetes-sigs/node-feature-discovery/blob/bcf95d93885fbb8934a813b71d5172bc6f4ec371/source/memory/memory.go#L76 if NUMA is detected. If I'm not wrong, that's the only NUMA related label that is exposed as a label and to get more info, like how many and which cores belong to each NUMA node, you could probably look in to the NRT resource as shown above.

fmuyassarov avatar Jul 10 '25 10:07 fmuyassarov

This:

E0430 12:07:37.432223       1 main.go:66] "error while running" err="failed to create NodeResourceTopology: the server could not find the requested resource (post noderesourcetopologies.topology.node.k8s.io)"

seems to suggest the apiserver does not have the NodeResourceTopology CRD installed. Is that the case? can be a bug of the helm chart if so (note I'm highly speculating here)

ffromani avatar Jul 10 '25 10:07 ffromani

ops, I missed that line. Tried to reproduce it by excluding the CRD from the installation and got somewhat similar result.

I0710 10:52:25.401335       1 podresourcesscanner.go:231] "pod doesn't have devices" podName="nfd-topology-updater-xqhww"
I0710 10:52:25.601461       1 podresourcesscanner.go:148] "scanning pod" podName="kube-apiserver-eseldb12u01"
I0710 10:52:25.601507       1 podresourcesscanner.go:231] "pod doesn't have devices" podName="kube-apiserver-eseldb12u01"
I0710 10:52:25.801273       1 podresourcesscanner.go:148] "scanning pod" podName="calico-apiserver-c847cf5c7-8wvcq"
I0710 10:52:25.801307       1 podresourcesscanner.go:231] "pod doesn't have devices" podName="calico-apiserver-c847cf5c7-8wvcq"
I0710 10:52:26.291680       1 nfd-topology-updater.go:201] "http server stopped" exitCode="http: Server closed"
E0710 10:52:26.291813       1 main.go:62] "error while running" err="failed to create NodeResourceTopology: the server could not find the requested resource (post noderesourcetopologies.topology.node.k8s.io)"
[event: pod node-feature-discovery/nfd-topology-updater-xqhww] Container image "ttl.sh/gcr.io_k8s-staging-nfd_node-feature-discovery:tilt-c2820ad1181ffe6b" already present on machine
Detected container restart. Pod: nfd-topology-updater-xqhww. Container: nfd-topology-updater.

fmuyassarov avatar Jul 10 '25 10:07 fmuyassarov

@dittops you could probably see from the pod listing that the topology updater is in error state right?

node-feature-discovery nfd-topology-updater-xqhww 0/1 Error 4 (50s ago)

If you used the Helm, I believe the NRT CRD should get installed via https://github.com/kubernetes-sigs/node-feature-discovery/blob/master/deployment/helm/node-feature-discovery/templates/topologyupdater-crds.yaml. I wonder if you have done something that could result in the CRD deletion ?

fmuyassarov avatar Jul 10 '25 10:07 fmuyassarov

Also, have you set createCRDs Helm flag to true?

fmuyassarov avatar Jul 10 '25 10:07 fmuyassarov

the source of the error is https://github.com/kubernetes-sigs/node-feature-discovery/blob/v0.17.3/pkg/nfd-topology-updater/nfd-topology-updater.go#L310

ffromani avatar Jul 10 '25 11:07 ffromani

Looks like we should address this one in the documentation and the logs(?)

marquiz avatar Jul 24 '25 16:07 marquiz

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Oct 22 '25 17:10 k8s-triage-robot