node-feature-discovery
node-feature-discovery copied to clipboard
nfd-worker pods running on non-control plane node repeatedly crash in error loop
Hi there, I'm trying to get NFD working in my k8s cluster. I have 3 nodes in my cluster; two workers and a control plane. Immediately after installing, the nfd-worker pods working on the two worker nodes go into a crash, restart, repeat loop. Logs from one of the workers;
I0926 15:34:35.861546 1 nfd-worker.go:155] Node Feature Discovery Worker v0.11.2
I0926 15:34:35.861683 1 nfd-worker.go:156] NodeName: 'redacted'
I0926 15:34:35.862397 1 nfd-worker.go:423] configuration file "/etc/kubernetes/node-feature-discovery/nfd-worker.conf" parsed
I0926 15:34:35.862666 1 nfd-worker.go:461] worker (re-)configuration successfully completed
I0926 15:34:35.862819 1 base.go:127] connecting to nfd-master at nfd-master:8080 ...
I0926 15:34:35.862970 1 component.go:36] [core]parsed scheme: ""
I0926 15:34:35.862993 1 component.go:36] [core]scheme "" not registered, fallback to default scheme
I0926 15:34:35.863103 1 component.go:36] [core]ccResolverWrapper: sending update to cc: {[{nfd-master:8080 <nil> 0 <nil>}] <nil> <nil>}
I0926 15:34:35.863131 1 component.go:36] [core]ClientConn switching balancer to "pick_first"
I0926 15:34:35.863141 1 component.go:36] [core]Channel switches to new LB policy "pick_first"
I0926 15:34:35.863317 1 component.go:36] [core]Subchannel Connectivity change to CONNECTING
I0926 15:34:35.863484 1 component.go:36] [core]Subchannel picks a new address "nfd-master:8080" to connect
I0926 15:34:35.863792 1 component.go:36] [core]Channel Connectivity change to CONNECTING
W0926 15:34:55.874327 1 component.go:41] [core]grpc: addrConn.createTransport failed to connect to {nfd-master:8080 nfd-master:8080 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp: i/o timeout". Reconnecting...
I0926 15:34:55.874363 1 component.go:36] [core]Subchannel Connectivity change to TRANSIENT_FAILURE
I0926 15:34:55.874391 1 component.go:36] [core]Channel Connectivity change to TRANSIENT_FAILURE
I0926 15:34:56.874543 1 component.go:36] [core]Subchannel Connectivity change to CONNECTING
I0926 15:34:56.874584 1 component.go:36] [core]Subchannel picks a new address "nfd-master:8080" to connect
I0926 15:34:56.874939 1 component.go:36] [core]Channel Connectivity change to CONNECTING
W0926 15:35:16.874992 1 component.go:41] [core]grpc: addrConn.createTransport failed to connect to {nfd-master:8080 nfd-master:8080 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp: i/o timeout". Reconnecting...
I0926 15:35:16.875306 1 component.go:36] [core]Subchannel Connectivity change to TRANSIENT_FAILURE
I0926 15:35:16.875543 1 component.go:36] [core]Channel Connectivity change to TRANSIENT_FAILURE
I0926 15:35:18.398673 1 component.go:36] [core]Subchannel Connectivity change to CONNECTING
I0926 15:35:18.398720 1 component.go:36] [core]Subchannel picks a new address "nfd-master:8080" to connect
I0926 15:35:18.399260 1 component.go:36] [core]Channel Connectivity change to CONNECTING
I0926 15:35:35.863059 1 component.go:36] [core]Channel Connectivity change to SHUTDOWN
I0926 15:35:35.863307 1 component.go:36] [core]Subchannel Connectivity change to SHUTDOWN
F0926 15:35:35.863476 1 main.go:64] failed to connect: context deadline exceeded
Additionally, if I try to get logs from the nfd-master pod or the nfd-worker pod that runs on the control plane, I get this error:
Error from server (BadRequest): previous terminated container "nfd-master" in pod "nfd-master-659d87f8d9-xzddh" not found
Any idea what's going on here? I'm fairly new to k8s
Hi @mantramantra12 . Your NFD worker is not able to connect to the master or be exact gRPC-Go client can't connect to the server and channel connectivity is not reaching the READY state. How are you installing the NFD ?
Seems like nfd-master is not running at all. What do you get with kubectl describe <nfd-master-pod>?
Installing NFD with kubectl apply -k https://github.com/kubernetes-sigs/node-feature-discovery/deployment/overlays/default?ref=v0.11.2 as per quick start installation guide.
kubectl describe <nfd-master-pod> gives Error from server (NotFound): pods "<nfd-master-pod>" not found, despite the pod being shown as 'Ready' with kubectl get pods -n node-feature-discovery
If I do kubectl get no -o json | jq .items[].metadata.labels, only the control plane node has NFD labels
kubectl describe
I meant to use the name of the actual pod, but this should work too
kubectl -n node-feature-discovery describe po nfd-master
Name: nfd-master-659d87f8d9-xzddh
Namespace: node-feature-discovery
Priority: 0
Node: nodename/nodeip
Start Time: Mon, 26 Sep 2022 15:32:13 +0000
Labels: app=nfd-master
pod-template-hash=659d87f8d9
Annotations: cni.projectcalico.org/containerID: <containerID>
cni.projectcalico.org/podIP: 192.168.27.39/32
cni.projectcalico.org/podIPs: 192.168.27.39/32
Status: Running
IP: 192.168.27.39
IPs:
IP: 192.168.27.39
Controlled By: ReplicaSet/nfd-master-659d87f8d9
Containers:
nfd-master:
Container ID: containerd://ade0e53a44c536c100bc91f2ba7ac7f55d9e7e137b21787e7f048cc80edd1d00
Image: k8s.gcr.io/nfd/node-feature-discovery:v0.11.2
Image ID: k8s.gcr.io/nfd/node-feature-discovery@sha256:99112589d9bc5521fb465fd9cba96066a03eb7c904ce2c39c08f6ad709cece56
Port: <none>
Host Port: <none>
Command:
nfd-master
State: Running
Started: Mon, 26 Sep 2022 15:32:17 +0000
Ready: True
Restart Count: 0
Liveness: exec [/usr/bin/grpc_health_probe -addr=:8080] delay=10s timeout=1s period=10s #success=1 #failure=3
Readiness: exec [/usr/bin/grpc_health_probe -addr=:8080] delay=5s timeout=1s period=10s #success=1 #failure=10
Environment:
NODE_NAME: (v1:spec.nodeName)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hhv5s (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kube-api-access-hhv5s:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node-role.kubernetes.io/control-plane:NoSchedule
node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
I guess your pod network is not working
If I do
kubectl get no -o json | jq .items[].metadata.labels, only the control plane node has NFD labels
That's because your controlplane node doesn't allow scheduling pods by default. As such, nfd worker pod can't be scheduled on that node, so no labeling. To enable, you need to untaint your controlplane node, which will allow NFD to start a worker pod on your controlplane node.
kubectl taint node "node-name" node-role.kubernetes.io/control-plane-
I guess your pod network is not working
I'm not sure whether it's related, but we've seen an issue where cluster nodes (VM nodes running on bare-metal server that acts as cluster master node) run out of memory, and things fail. And the issue being that after the OOM condition goes away (e.g. by rebooting all the OOMing nodes), other pods [1] recover, but NFD workers do not. They complain about failing nfd-master communication.
Deleting whole NFD deployment, and reapplying it, can help. What persistent state NFD workers (or k8s features they) rely on, that can break their NFD master communication like that, and which recovering requires re-creating NFD deployment?
[1] Cluster is not running much things besides control plane, device plugins and few simple (http+json) test services, so that's not necessarily telling much.
I am also having the same output as the op. I tried fallowing this:
kubectl taint node "node-name" node-role.kubernetes.io/control-plane-
I got:
error: at least one taint update is required
I am guessing I am having a separate issue. I think I just am missing another service to help the master and workers talk but I have no idea where to start.
I am guessing I am having a separate issue. I think I just am missing another service to help the master and workers talk but I have no idea where to start.
NFD uses the standard kubernetes Service mechanism for communication between nfd-worker and nfd-master. There have been N+1 failure reports like this and it's virtually always been a failure in pod networking (cni)
I am guessing I am having a separate issue. I think I just am missing another service to help the master and workers talk but I have no idea where to start.
NFD uses the standard kubernetes Service mechanism for communication between nfd-worker and nfd-master. There have been N+1 failure reports like this and it's virtually always been a failure in pod networking (cni)
Thank you. I was able to make progress migrating to calico.
So I found the cause of my problem was that my cluster is configured with linkerd which auto-injects proxies into pods, so I had to add a policy to allow communication between worker pods and master pods