node-feature-discovery icon indicating copy to clipboard operation
node-feature-discovery copied to clipboard

nfd-worker pods running on non-control plane node repeatedly crash in error loop

Open mantramantra12 opened this issue 3 years ago • 8 comments

Hi there, I'm trying to get NFD working in my k8s cluster. I have 3 nodes in my cluster; two workers and a control plane. Immediately after installing, the nfd-worker pods working on the two worker nodes go into a crash, restart, repeat loop. Logs from one of the workers;

I0926 15:34:35.861546       1 nfd-worker.go:155] Node Feature Discovery Worker v0.11.2
I0926 15:34:35.861683       1 nfd-worker.go:156] NodeName: 'redacted'
I0926 15:34:35.862397       1 nfd-worker.go:423] configuration file "/etc/kubernetes/node-feature-discovery/nfd-worker.conf" parsed
I0926 15:34:35.862666       1 nfd-worker.go:461] worker (re-)configuration successfully completed
I0926 15:34:35.862819       1 base.go:127] connecting to nfd-master at nfd-master:8080 ...
I0926 15:34:35.862970       1 component.go:36] [core]parsed scheme: ""
I0926 15:34:35.862993       1 component.go:36] [core]scheme "" not registered, fallback to default scheme
I0926 15:34:35.863103       1 component.go:36] [core]ccResolverWrapper: sending update to cc: {[{nfd-master:8080  <nil> 0 <nil>}] <nil> <nil>}
I0926 15:34:35.863131       1 component.go:36] [core]ClientConn switching balancer to "pick_first"
I0926 15:34:35.863141       1 component.go:36] [core]Channel switches to new LB policy "pick_first"
I0926 15:34:35.863317       1 component.go:36] [core]Subchannel Connectivity change to CONNECTING
I0926 15:34:35.863484       1 component.go:36] [core]Subchannel picks a new address "nfd-master:8080" to connect
I0926 15:34:35.863792       1 component.go:36] [core]Channel Connectivity change to CONNECTING
W0926 15:34:55.874327       1 component.go:41] [core]grpc: addrConn.createTransport failed to connect to {nfd-master:8080 nfd-master:8080 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp: i/o timeout". Reconnecting...
I0926 15:34:55.874363       1 component.go:36] [core]Subchannel Connectivity change to TRANSIENT_FAILURE
I0926 15:34:55.874391       1 component.go:36] [core]Channel Connectivity change to TRANSIENT_FAILURE
I0926 15:34:56.874543       1 component.go:36] [core]Subchannel Connectivity change to CONNECTING
I0926 15:34:56.874584       1 component.go:36] [core]Subchannel picks a new address "nfd-master:8080" to connect
I0926 15:34:56.874939       1 component.go:36] [core]Channel Connectivity change to CONNECTING
W0926 15:35:16.874992       1 component.go:41] [core]grpc: addrConn.createTransport failed to connect to {nfd-master:8080 nfd-master:8080 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp: i/o timeout". Reconnecting...
I0926 15:35:16.875306       1 component.go:36] [core]Subchannel Connectivity change to TRANSIENT_FAILURE
I0926 15:35:16.875543       1 component.go:36] [core]Channel Connectivity change to TRANSIENT_FAILURE
I0926 15:35:18.398673       1 component.go:36] [core]Subchannel Connectivity change to CONNECTING
I0926 15:35:18.398720       1 component.go:36] [core]Subchannel picks a new address "nfd-master:8080" to connect
I0926 15:35:18.399260       1 component.go:36] [core]Channel Connectivity change to CONNECTING
I0926 15:35:35.863059       1 component.go:36] [core]Channel Connectivity change to SHUTDOWN
I0926 15:35:35.863307       1 component.go:36] [core]Subchannel Connectivity change to SHUTDOWN
F0926 15:35:35.863476       1 main.go:64] failed to connect: context deadline exceeded

Additionally, if I try to get logs from the nfd-master pod or the nfd-worker pod that runs on the control plane, I get this error:

Error from server (BadRequest): previous terminated container "nfd-master" in pod "nfd-master-659d87f8d9-xzddh" not found

Any idea what's going on here? I'm fairly new to k8s

mantramantra12 avatar Sep 26 '22 15:09 mantramantra12

Hi @mantramantra12 . Your NFD worker is not able to connect to the master or be exact gRPC-Go client can't connect to the server and channel connectivity is not reaching the READY state. How are you installing the NFD ?

fmuyassarov avatar Sep 26 '22 23:09 fmuyassarov

Seems like nfd-master is not running at all. What do you get with kubectl describe <nfd-master-pod>?

marquiz avatar Sep 27 '22 07:09 marquiz

Installing NFD with kubectl apply -k https://github.com/kubernetes-sigs/node-feature-discovery/deployment/overlays/default?ref=v0.11.2 as per quick start installation guide.

kubectl describe <nfd-master-pod> gives Error from server (NotFound): pods "<nfd-master-pod>" not found, despite the pod being shown as 'Ready' with kubectl get pods -n node-feature-discovery

mantramantra12 avatar Sep 27 '22 08:09 mantramantra12

If I do kubectl get no -o json | jq .items[].metadata.labels, only the control plane node has NFD labels

mantramantra12 avatar Sep 27 '22 08:09 mantramantra12

kubectl describe

I meant to use the name of the actual pod, but this should work too

kubectl -n node-feature-discovery describe po nfd-master

marquiz avatar Sep 27 '22 11:09 marquiz

Name:         nfd-master-659d87f8d9-xzddh
Namespace:    node-feature-discovery
Priority:     0
Node:         nodename/nodeip
Start Time:   Mon, 26 Sep 2022 15:32:13 +0000
Labels:       app=nfd-master
              pod-template-hash=659d87f8d9
Annotations:  cni.projectcalico.org/containerID: <containerID>
              cni.projectcalico.org/podIP: 192.168.27.39/32
              cni.projectcalico.org/podIPs: 192.168.27.39/32
Status:       Running
IP:           192.168.27.39
IPs:
  IP:           192.168.27.39
Controlled By:  ReplicaSet/nfd-master-659d87f8d9
Containers:
  nfd-master:
    Container ID:  containerd://ade0e53a44c536c100bc91f2ba7ac7f55d9e7e137b21787e7f048cc80edd1d00
    Image:         k8s.gcr.io/nfd/node-feature-discovery:v0.11.2
    Image ID:      k8s.gcr.io/nfd/node-feature-discovery@sha256:99112589d9bc5521fb465fd9cba96066a03eb7c904ce2c39c08f6ad709cece56
    Port:          <none>
    Host Port:     <none>
    Command:
      nfd-master
    State:          Running
      Started:      Mon, 26 Sep 2022 15:32:17 +0000
    Ready:          True
    Restart Count:  0
    Liveness:       exec [/usr/bin/grpc_health_probe -addr=:8080] delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:      exec [/usr/bin/grpc_health_probe -addr=:8080] delay=5s timeout=1s period=10s #success=1 #failure=10
    Environment:
      NODE_NAME:   (v1:spec.nodeName)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hhv5s (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  kube-api-access-hhv5s:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node-role.kubernetes.io/control-plane:NoSchedule
                             node-role.kubernetes.io/master:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>

mantramantra12 avatar Sep 27 '22 11:09 mantramantra12

I guess your pod network is not working

marquiz avatar Sep 27 '22 13:09 marquiz

If I do kubectl get no -o json | jq .items[].metadata.labels, only the control plane node has NFD labels

That's because your controlplane node doesn't allow scheduling pods by default. As such, nfd worker pod can't be scheduled on that node, so no labeling. To enable, you need to untaint your controlplane node, which will allow NFD to start a worker pod on your controlplane node.

 kubectl taint node "node-name" node-role.kubernetes.io/control-plane-

fmuyassarov avatar Sep 28 '22 13:09 fmuyassarov

I guess your pod network is not working

I'm not sure whether it's related, but we've seen an issue where cluster nodes (VM nodes running on bare-metal server that acts as cluster master node) run out of memory, and things fail. And the issue being that after the OOM condition goes away (e.g. by rebooting all the OOMing nodes), other pods [1] recover, but NFD workers do not. They complain about failing nfd-master communication.

Deleting whole NFD deployment, and reapplying it, can help. What persistent state NFD workers (or k8s features they) rely on, that can break their NFD master communication like that, and which recovering requires re-creating NFD deployment?

[1] Cluster is not running much things besides control plane, device plugins and few simple (http+json) test services, so that's not necessarily telling much.

eero-t avatar Oct 27 '22 09:10 eero-t

I am also having the same output as the op. I tried fallowing this:

 kubectl taint node "node-name" node-role.kubernetes.io/control-plane-

I got:

error: at least one taint update is required

I am guessing I am having a separate issue. I think I just am missing another service to help the master and workers talk but I have no idea where to start.

Z02X avatar Nov 08 '22 06:11 Z02X

I am guessing I am having a separate issue. I think I just am missing another service to help the master and workers talk but I have no idea where to start.

NFD uses the standard kubernetes Service mechanism for communication between nfd-worker and nfd-master. There have been N+1 failure reports like this and it's virtually always been a failure in pod networking (cni)

marquiz avatar Nov 08 '22 08:11 marquiz

I am guessing I am having a separate issue. I think I just am missing another service to help the master and workers talk but I have no idea where to start.

NFD uses the standard kubernetes Service mechanism for communication between nfd-worker and nfd-master. There have been N+1 failure reports like this and it's virtually always been a failure in pod networking (cni)

Thank you. I was able to make progress migrating to calico.

Z02X avatar Nov 09 '22 03:11 Z02X

So I found the cause of my problem was that my cluster is configured with linkerd which auto-injects proxies into pods, so I had to add a policy to allow communication between worker pods and master pods

mantramantra12 avatar Nov 11 '22 11:11 mantramantra12