node-feature-discovery NFD worker fails to communicate with NFD master after worker node rejoin to the cluster

What happened:

After a node(as a NFD worker) was deleted through "kubectl" command and rejoin to the cluster, the NFD worker failed to communicate with NFD master. And there are some error logs in nfd-worker pod.

I0526 15:13:24.461238 1 component.go:36] [core]parsed scheme: "" I0526 15:13:24.461249 1 component.go:36] [core]scheme "" not registered, fallback to default scheme I0526 15:13:24.461364 1 component.go:36] [core]ccResolverWrapper: sending update to cc: {[{nfd-node-feature-discovery-master:8080 0 }] } I0526 15:13:24.461377 1 component.go:36] [core]ClientConn switching balancer to "pick_first" I0526 15:13:24.461472 1 component.go:36] [core]Channel switches to new LB policy "pick_first" I0526 15:13:24.461504 1 component.go:36] [core]Subchannel Connectivity change to CONNECTING I0526 15:13:24.461672 1 component.go:36] [core]Subchannel picks a new address "nfd-node-feature-discovery-master:8080" to connect I0526 15:13:24.465091 1 component.go:36] [core]Channel Connectivity change to CONNECTING W0526 15:13:44.461771 1 component.go:41] [core]grpc: addrConn.createTransport failed to connect to {nfd-node-feature-discovery-master:8080 nfd-node-feature-discovery-master:8080 0 }. Err: connection error: desc = "transport: Error while dialing dial tcp: i/o timeout". Reconnecting... I0526 15:13:44.461914 1 component.go:36] [core]Subchannel Connectivity change to TRANSIENT_FAILURE I0526 15:13:44.462098 1 component.go:36] [core]Channel Connectivity change to TRANSIENT_FAILURE

What you expected to happen:

NFD worker should communicate with NFD master normally after worker node rejoin to the cluster

How to reproduce it (as minimally and precisely as possible):

After NFD service deploy through helm successfully. Delete one worker node and after the NFD worker pod was deleted, rejoin the node to the cluster

Anything else we need to know?:

The NFD worker gateway address left after node delete and after rejoin the node, there will be two gateway addresses

Environment:

Kubernetes version (use kubectl version): client: v1.22.0; server:v1.17.8+vmware.1
Cloud provider or hardware configuration: Tanzu
OS (e.g: cat /etc/os-release): Debian GNU/Linux 10 (buster)
Kernel (e.g. uname -a):5.4.115
Install tools: helm

May 26 '22 15:05 Tianhao-intel

Looks unlikely that it's anything NFD-related. I suspect your pod network is not working correctly. Check dns and cni on the node

Jun 01 '22 13:06 marquiz

I get this when running Minikube from time to time, my work around is

kubectl -n kube-system rollout restart deployment coredns

it always fix the network issue, and NFD workers communicate back with the master :)

Jun 08 '22 11:06 ArangoGutierrez

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Sep 06 '22 11:09 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Oct 06 '22 12:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Nov 05 '22 12:11 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Nov 05 '22 12:11 k8s-ci-robot

node-feature-discovery node-feature-discovery copied to clipboard

NFD worker fails to communicate with NFD master after worker node rejoin to the cluster

node-feature-discovery
node-feature-discovery copied to clipboard