Topology updater is failing to collect NUMA information
What happened:
I have installed the 0.17.3 version of nfd using helm. I want to get the numa node topology, so I enabled the topology updater while installing. But numa details was not added in the label. I have multiple numa while checking with lscpu. Here is the log
sdp@fl4u42:~$ kubectl logs -f nfd-node-feature-discovery-topology-updater-g8wl7
I0430 12:06:36.208275 1 nfd-topology-updater.go:163] "Node Feature Discovery Topology Updater" version="v0.17.3" nodeName="fl4u42"
I0430 12:06:36.208337 1 component.go:34] [core]original dial target is: "/host-var/lib/kubelet-podresources/kubelet.sock"
I0430 12:06:36.208357 1 component.go:34] [core][Channel #1]Channel created
I0430 12:06:36.208371 1 component.go:34] [core][Channel #1]parsed dial target is: resolver.Target{URL:url.URL{Scheme:"passthrough", Opaque:"", User:(*url.Userinfo)(nil), Host:"", Path:"//host-var/lib/kubelet-podresources/kubelet.sock", RawPath:"", OmitHost:false, ForceQuery:false, RawQuery:"", Fragment:"", RawFragment:""}}
I0430 12:06:36.208375 1 component.go:34] [core][Channel #1]Channel authority set to "%2Fhost-var%2Flib%2Fkubelet-podresources%2Fkubelet.sock"
I0430 12:06:36.208511 1 component.go:34] [core][Channel #1]Resolver state updated: {
"Addresses": [
{
"Addr": "/host-var/lib/kubelet-podresources/kubelet.sock",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Endpoints": [
{
"Addresses": [
{
"Addr": "/host-var/lib/kubelet-podresources/kubelet.sock",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Attributes": null
}
],
"ServiceConfig": null,
"Attributes": null
} (resolver returned new addresses)
I0430 12:06:36.208535 1 component.go:34] [core][Channel #1]Channel switches to new LB policy "pick_first"
I0430 12:06:36.208562 1 component.go:34] [core][Channel #1 SubChannel #2]Subchannel created
I0430 12:06:36.208569 1 component.go:34] [core][Channel #1]Channel Connectivity change to CONNECTING
I0430 12:06:36.208577 1 component.go:34] [core][Channel #1]Channel exiting idle mode
2025/04/30 12:06:36 Connected to '"/host-var/lib/kubelet-podresources/kubelet.sock"'!
I0430 12:06:36.208679 1 component.go:34] [core][Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING
I0430 12:06:36.208720 1 component.go:34] [core][Channel #1 SubChannel #2]Subchannel picks a new address "/host-var/lib/kubelet-podresources/kubelet.sock" to connect
I0430 12:06:36.208987 1 component.go:34] [core][Channel #1 SubChannel #2]Subchannel Connectivity change to READY
I0430 12:06:36.209010 1 component.go:34] [core][Channel #1]Channel Connectivity change to READY
I0430 12:06:36.209018 1 nfd-topology-updater.go:375] "configuration file parsed" path="/etc/kubernetes/node-feature-discovery/nfd-topology-updater.conf" config={"ExcludeList":null}
I0430 12:06:36.209061 1 podresourcesscanner.go:53] "watching all namespaces"
WARNING: failed to read int from file: open /host-sys/devices/system/node/node0/cpu0/online: no such file or directory
I0430 12:06:36.209247 1 metrics.go:44] "metrics server starting" port=":8081"
I0430 12:06:36.267613 1 component.go:34] [core][Server #4]Server created
I0430 12:06:36.267645 1 nfd-topology-updater.go:145] "gRPC health server serving" port=8082
I0430 12:06:36.267690 1 component.go:34] [core][Server #4 ListenSocket #5]ListenSocket created
I0430 12:07:36.217041 1 podresourcesscanner.go:137] "podFingerprint calculated" status=<
> processing node ""
> processing 15 pods
+ aibrix-system/aibrix-kuberay-operator-55f5ddcbf4-vqrwb
+ default/nfd-node-feature-discovery-worker-w5cvn
+ aibrix-system/aibrix-redis-master-7bff9b56f5-hs5k4
+ envoy-gateway-system/envoy-gateway-5bfc954ffc-k4tf7
+ kube-system/metrics-server-5985cbc9d7-vh9pb
+ aibrix-system/aibrix-controller-manager-6489d5b587-hj2bt
+ aibrix-system/aibrix-gateway-plugins-58bdc89d9c-q67pp
+ envoy-gateway-system/envoy-aibrix-system-aibrix-eg-903790dc-54766c9758-l68wh
+ kube-system/helm-install-traefik-crd-kz6kg
+ default/nfd-node-feature-discovery-topology-updater-g8wl7
+ kube-system/svclb-envoy-aibrix-system-aibrix-eg-903790dc-1f213b6c-fdvw4
+ aibrix-system/aibrix-gpu-optimizer-75df97858d-5zb5s
+ kube-system/helm-install-traefik-j89k5
+ aibrix-system/aibrix-metadata-service-66f45c85bc-k8pzx
+ kube-system/local-path-provisioner-5cf85fd84d-hgf67
= pfp0v0011be09f6ff65dbfe0
>
I0430 12:07:36.217093 1 podresourcesscanner.go:148] "scanning pod" podName="aibrix-kuberay-operator-55f5ddcbf4-vqrwb"
I0430 12:07:36.217115 1 podresourcesscanner.go:231] "pod doesn't have devices" podName="aibrix-kuberay-operator-55f5ddcbf4-vqrwb"
I0430 12:07:36.223315 1 podresourcesscanner.go:148] "scanning pod" podName="nfd-node-feature-discovery-worker-w5cvn"
I0430 12:07:36.223325 1 podresourcesscanner.go:231] "pod doesn't have devices" podName="nfd-node-feature-discovery-worker-w5cvn"
I0430 12:07:36.225915 1 podresourcesscanner.go:148] "scanning pod" podName="aibrix-redis-master-7bff9b56f5-hs5k4"
I0430 12:07:36.225935 1 podresourcesscanner.go:231] "pod doesn't have devices" podName="aibrix-redis-master-7bff9b56f5-hs5k4"
I0430 12:07:36.228169 1 podresourcesscanner.go:148] "scanning pod" podName="envoy-gateway-5bfc954ffc-k4tf7"
I0430 12:07:36.228195 1 podresourcesscanner.go:231] "pod doesn't have devices" podName="envoy-gateway-5bfc954ffc-k4tf7"
I0430 12:07:36.231774 1 podresourcesscanner.go:148] "scanning pod" podName="metrics-server-5985cbc9d7-vh9pb"
I0430 12:07:36.231788 1 podresourcesscanner.go:231] "pod doesn't have devices" podName="metrics-server-5985cbc9d7-vh9pb"
I0430 12:07:36.233367 1 podresourcesscanner.go:148] "scanning pod" podName="aibrix-controller-manager-6489d5b587-hj2bt"
I0430 12:07:36.233374 1 podresourcesscanner.go:231] "pod doesn't have devices" podName="aibrix-controller-manager-6489d5b587-hj2bt"
I0430 12:07:36.234769 1 podresourcesscanner.go:148] "scanning pod" podName="aibrix-gateway-plugins-58bdc89d9c-q67pp"
I0430 12:07:36.234779 1 podresourcesscanner.go:231] "pod doesn't have devices" podName="aibrix-gateway-plugins-58bdc89d9c-q67pp"
I0430 12:07:36.236354 1 podresourcesscanner.go:148] "scanning pod" podName="envoy-aibrix-system-aibrix-eg-903790dc-54766c9758-l68wh"
I0430 12:07:36.236361 1 podresourcesscanner.go:231] "pod doesn't have devices" podName="envoy-aibrix-system-aibrix-eg-903790dc-54766c9758-l68wh"
I0430 12:07:36.238011 1 podresourcesscanner.go:148] "scanning pod" podName="helm-install-traefik-crd-kz6kg"
I0430 12:07:36.238017 1 podresourcesscanner.go:231] "pod doesn't have devices" podName="helm-install-traefik-crd-kz6kg"
I0430 12:07:36.239514 1 podresourcesscanner.go:148] "scanning pod" podName="nfd-node-feature-discovery-topology-updater-g8wl7"
I0430 12:07:36.239521 1 podresourcesscanner.go:231] "pod doesn't have devices" podName="nfd-node-feature-discovery-topology-updater-g8wl7"
I0430 12:07:36.241754 1 podresourcesscanner.go:148] "scanning pod" podName="svclb-envoy-aibrix-system-aibrix-eg-903790dc-1f213b6c-fdvw4"
I0430 12:07:36.241760 1 podresourcesscanner.go:231] "pod doesn't have devices" podName="svclb-envoy-aibrix-system-aibrix-eg-903790dc-1f213b6c-fdvw4"
I0430 12:07:36.422134 1 podresourcesscanner.go:148] "scanning pod" podName="aibrix-gpu-optimizer-75df97858d-5zb5s"
I0430 12:07:36.422165 1 podresourcesscanner.go:231] "pod doesn't have devices" podName="aibrix-gpu-optimizer-75df97858d-5zb5s"
I0430 12:07:36.621889 1 podresourcesscanner.go:148] "scanning pod" podName="helm-install-traefik-j89k5"
I0430 12:07:36.621923 1 podresourcesscanner.go:231] "pod doesn't have devices" podName="helm-install-traefik-j89k5"
I0430 12:07:36.821266 1 podresourcesscanner.go:148] "scanning pod" podName="aibrix-metadata-service-66f45c85bc-k8pzx"
I0430 12:07:36.821294 1 podresourcesscanner.go:231] "pod doesn't have devices" podName="aibrix-metadata-service-66f45c85bc-k8pzx"
I0430 12:07:37.022025 1 podresourcesscanner.go:148] "scanning pod" podName="local-path-provisioner-5cf85fd84d-hgf67"
I0430 12:07:37.022057 1 podresourcesscanner.go:231] "pod doesn't have devices" podName="local-path-provisioner-5cf85fd84d-hgf67"
I0430 12:07:37.432143 1 metrics.go:51] "stopping metrics server" port=":8081"
I0430 12:07:37.432207 1 metrics.go:45] "metrics server stopped" exitCode="http: Server closed"
E0430 12:07:37.432223 1 main.go:66] "error while running" err="failed to create NodeResourceTopology: the server could not find the requested resource (post noderesourcetopologies.topology.node.k8s.io)"
lscpu snippet
NUMA:
NUMA node(s): 8
NUMA node0 CPU(s): 0-13,112-125
NUMA node1 CPU(s): 14-27,126-139
NUMA node2 CPU(s): 28-41,140-153
NUMA node3 CPU(s): 42-55,154-167
NUMA node4 CPU(s): 56-69,168-181
NUMA node5 CPU(s): 70-83,182-195
NUMA node6 CPU(s): 84-97,196-209
NUMA node7 CPU(s): 98-111,210-223
Environment:
- Kubernetes version (use
kubectl version): v1.31.3+k3s1 - Cloud provider or hardware configuration: Onprem hardware, Intel(R) Xeon(R) Platinum 8480+, 512GB
- OS (e.g:
cat /etc/os-release): Ubuntu 23.04 - Kernel (e.g.
uname -a): 6.2.0-39-generic - Install tools:
- Network plugin and version (if this is a network-related bug):
- Others:
/cc @PiotrProkop @ffromani
@dittops what about NRT resource? Have you checked if the NUMA architecture was expoised there? I just did it a quick test
apiVersion: topology.node.k8s.io/v1alpha2
attributes:
- name: topologyManagerPolicy
value: none
- name: topologyManagerScope
value: container
- name: nodeTopologyPodsFingerprint
value: pfp0v0014f8589198c824330
kind: NodeResourceTopology
metadata:
creationTimestamp: "2025-07-10T10:13:05Z"
generation: 7
name: eseldb12u01
ownerReferences:
- apiVersion: v1
kind: Namespace
name: node-feature-discovery
uid: 78a9cfdc-8033-4743-bcfd-c9b6be5ef65d
resourceVersion: "453128"
uid: a983b8da-9cc8-4284-8edb-36ab7ae17d6b
topologyPolicies:
- None
zones:
- costs:
- name: node-0
value: 10
- name: node-1
value: 21
name: node-0
resources:
- allocatable: "0"
available: "0"
capacity: "64"
name: cpu
type: Node
- costs:
- name: node-0
value: 21
- name: node-1
value: 10
name: node-1
resources:
- allocatable: "0"
available: "0"
capacity: "64"
name: cpu
type: Node
and from the labels I can see numa set to true
kubectl get nodes -oyaml | grep numa
feature.node.kubernetes.io/memory-numa: "true"
which is set in https://github.com/kubernetes-sigs/node-feature-discovery/blob/bcf95d93885fbb8934a813b71d5172bc6f4ec371/source/memory/memory.go#L76 if NUMA is detected. If I'm not wrong, that's the only NUMA related label that is exposed as a label and to get more info, like how many and which cores belong to each NUMA node, you could probably look in to the NRT resource as shown above.
This:
E0430 12:07:37.432223 1 main.go:66] "error while running" err="failed to create NodeResourceTopology: the server could not find the requested resource (post noderesourcetopologies.topology.node.k8s.io)"
seems to suggest the apiserver does not have the NodeResourceTopology CRD installed. Is that the case? can be a bug of the helm chart if so (note I'm highly speculating here)
ops, I missed that line. Tried to reproduce it by excluding the CRD from the installation and got somewhat similar result.
I0710 10:52:25.401335 1 podresourcesscanner.go:231] "pod doesn't have devices" podName="nfd-topology-updater-xqhww"
I0710 10:52:25.601461 1 podresourcesscanner.go:148] "scanning pod" podName="kube-apiserver-eseldb12u01"
I0710 10:52:25.601507 1 podresourcesscanner.go:231] "pod doesn't have devices" podName="kube-apiserver-eseldb12u01"
I0710 10:52:25.801273 1 podresourcesscanner.go:148] "scanning pod" podName="calico-apiserver-c847cf5c7-8wvcq"
I0710 10:52:25.801307 1 podresourcesscanner.go:231] "pod doesn't have devices" podName="calico-apiserver-c847cf5c7-8wvcq"
I0710 10:52:26.291680 1 nfd-topology-updater.go:201] "http server stopped" exitCode="http: Server closed"
E0710 10:52:26.291813 1 main.go:62] "error while running" err="failed to create NodeResourceTopology: the server could not find the requested resource (post noderesourcetopologies.topology.node.k8s.io)"
[event: pod node-feature-discovery/nfd-topology-updater-xqhww] Container image "ttl.sh/gcr.io_k8s-staging-nfd_node-feature-discovery:tilt-c2820ad1181ffe6b" already present on machine
Detected container restart. Pod: nfd-topology-updater-xqhww. Container: nfd-topology-updater.
@dittops you could probably see from the pod listing that the topology updater is in error state right?
node-feature-discovery nfd-topology-updater-xqhww 0/1 Error 4 (50s ago)
If you used the Helm, I believe the NRT CRD should get installed via https://github.com/kubernetes-sigs/node-feature-discovery/blob/master/deployment/helm/node-feature-discovery/templates/topologyupdater-crds.yaml. I wonder if you have done something that could result in the CRD deletion ?
Also, have you set createCRDs Helm flag to true?
the source of the error is https://github.com/kubernetes-sigs/node-feature-discovery/blob/v0.17.3/pkg/nfd-topology-updater/nfd-topology-updater.go#L310
Looks like we should address this one in the documentation and the logs(?)
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale