node-feature-discovery icon indicating copy to clipboard operation
node-feature-discovery copied to clipboard

NFD worker fails to communicate with NFD master after worker node rejoin to the cluster

Open Tianhao-intel opened this issue 3 years ago • 4 comments

What happened:

After a node(as a NFD worker) was deleted through "kubectl" command and rejoin to the cluster, the NFD worker failed to communicate with NFD master. And there are some error logs in nfd-worker pod.

I0526 15:13:24.461238 1 component.go:36] [core]parsed scheme: "" I0526 15:13:24.461249 1 component.go:36] [core]scheme "" not registered, fallback to default scheme I0526 15:13:24.461364 1 component.go:36] [core]ccResolverWrapper: sending update to cc: {[{nfd-node-feature-discovery-master:8080 0 }] } I0526 15:13:24.461377 1 component.go:36] [core]ClientConn switching balancer to "pick_first" I0526 15:13:24.461472 1 component.go:36] [core]Channel switches to new LB policy "pick_first" I0526 15:13:24.461504 1 component.go:36] [core]Subchannel Connectivity change to CONNECTING I0526 15:13:24.461672 1 component.go:36] [core]Subchannel picks a new address "nfd-node-feature-discovery-master:8080" to connect I0526 15:13:24.465091 1 component.go:36] [core]Channel Connectivity change to CONNECTING W0526 15:13:44.461771 1 component.go:41] [core]grpc: addrConn.createTransport failed to connect to {nfd-node-feature-discovery-master:8080 nfd-node-feature-discovery-master:8080 0 }. Err: connection error: desc = "transport: Error while dialing dial tcp: i/o timeout". Reconnecting... I0526 15:13:44.461914 1 component.go:36] [core]Subchannel Connectivity change to TRANSIENT_FAILURE I0526 15:13:44.462098 1 component.go:36] [core]Channel Connectivity change to TRANSIENT_FAILURE

What you expected to happen:

NFD worker should communicate with NFD master normally after worker node rejoin to the cluster

How to reproduce it (as minimally and precisely as possible):

After NFD service deploy through helm successfully. Delete one worker node and after the NFD worker pod was deleted, rejoin the node to the cluster

Anything else we need to know?:

The NFD worker gateway address left after node delete and after rejoin the node, there will be two gateway addresses

Environment:

  • Kubernetes version (use kubectl version): client: v1.22.0; server:v1.17.8+vmware.1
  • Cloud provider or hardware configuration: Tanzu
  • OS (e.g: cat /etc/os-release): Debian GNU/Linux 10 (buster)
  • Kernel (e.g. uname -a):5.4.115
  • Install tools: helm

Tianhao-intel avatar May 26 '22 15:05 Tianhao-intel

Looks unlikely that it's anything NFD-related. I suspect your pod network is not working correctly. Check dns and cni on the node

marquiz avatar Jun 01 '22 13:06 marquiz

I get this when running Minikube from time to time, my work around is

kubectl -n kube-system rollout restart deployment coredns

it always fix the network issue, and NFD workers communicate back with the master :)

ArangoGutierrez avatar Jun 08 '22 11:06 ArangoGutierrez

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Sep 06 '22 11:09 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Oct 06 '22 12:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Nov 05 '22 12:11 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Nov 05 '22 12:11 k8s-ci-robot