calico icon indicating copy to clipboard operation
calico copied to clipboard

Update Kubernetes Node condition `NetworkUnavailable` in case of problems

Open danielfoehrKn opened this issue 5 years ago • 9 comments

Hi, we are sometimes encountering problems with calico-node on Kubernetes.

Issue Description The condition type NetworkUnavailable on a Kubernetes node seems to be only initialised to false on startup. However, sometimes we are experiencing issues such as (from calico-node logs):

bird: Mesh_10_250_10_60: Socket error: bind: Address in use
bird: Mesh_10_250_19_13: Socket error: bind: Address in use
bird: Mesh_10_250_21_169: Socket error: bind: Address in use

The aforementioned node condition is still set to False with reason CalicoIsUp indicating that everything is working fine.

Expected Behavior

I would expect that calico's state is reflected in the Kubernetes node condition NetworkUnavailable (also in case of abnormalities during runtime).

Current Behavior

The condition type NetworkUnavailable on a Kubernetes node is only initialised to false on startup so problem during runtime are not reflected in the node condition.

Another possibility is, that calico sets the NetworkUnavailable too early during startup, with the actual error happening after the initialisation.

Possible Solution

Steps to Reproduce (for bugs)

Context

We are creating Kubernetes node via the Machine Controller Manager with calcio-node as deamon set. We are monitoring the Node conditions including the NetworkUnavailable to be able to replace a node in case calico has problems.

Your Environment

  • calico/node version: <TODO>
  • Operating System and version: Ubuntu 18.04.4 LTS (though observed independently)
  • Link to your project (optional):

danielfoehrKn avatar Jun 16 '20 16:06 danielfoehrKn

To add to what @danielfoehrKn wrote, we are using calico v3.13.4.

zanetworker avatar Jun 16 '20 19:06 zanetworker

Cross referencing https://github.com/projectcalico/calico/issues/435, which discusses the same.

I will close that one, since it's a bit old and originally was scoped just at setting the value on launch.

caseydavenport avatar Jun 17 '20 16:06 caseydavenport

bird: Mesh_10_250_10_60: Socket error: bind: Address in use

Separate from this issue, do you guys know why you're encountering this? Sounds like it could be a bug. Might be worth opening another issue just to discuss this aspect of it.

caseydavenport avatar Jun 17 '20 16:06 caseydavenport

Separate from this issue, do you guys know why you're encountering this? Sounds like it could be a bug. Might be worth opening another issue just to discuss this aspect of it.

My guess is that bird is trying to to re-bind to the node, not sure why though.

Also created another issue projectcalico/node#522 for the bind error.

zanetworker avatar Jun 18 '20 16:06 zanetworker

I observed another incident when calico-node pod was not ready but the node condition NetworkUnavailable was healthy.

This time I do not think it has something to do with the Bird issue mentioned above, but a misconfiguration of the Kubernetes API server with mutating webhook configurations (Calico using Kubernetes as datastore). I am not 100% sure here (I do not know too much about the internals of calico) , but will paste the calico-node logs below.

bird: Mesh_10_250_0_4: State changed to down
bird: Reconfigured
2020-07-10 00:09:16.935 [INFO][49] resource.go 260: Target config /etc/calico/confd/config/bird.cfg has been updated
bird: Mesh_10_250_0_6: State changed to stop
bird: Mesh_10_250_0_6: State changed to down
bird: Mesh_10_250_0_6: Starting
bird: Mesh_10_250_0_6: State changed to start
bird: Mesh_10_250_0_3: State changed to start
2020-07-10 02:06:44.481 [INFO][49] util.go 66: /etc/calico/confd/config/bird.cfg has md5sum 15f7c1e79ac11f6e060b69f78724a3cc should be c28f3597463c952853af74688ea72f8c
2020-07-10 02:06:44.481 [INFO][49] resource.go 220: Target config /etc/calico/confd/config/bird.cfg out of sync
bird: Reconfiguration requested by SIGHUP
bird: Reconfiguration requested by SIGHUP
bird: Reconfiguring
bird: device1: Reconfigured
bird: direct1: Reconfigured
bird: Removing protocol Mesh_10_250_0_3
bird: Mesh_10_250_0_3: Shutting down
bird: Mesh_10_250_0_3: State changed to stop
bird: Mesh_10_250_0_6: Reconfigured
bird: Reconfiguring
bird: device1: Reconfigured
bird: direct1: Reconfigured
bird: Reconfigured
bird: Mesh_10_250_0_3: State changed to down
bird: Reconfigured
2020-07-10 02:06:44.490 [INFO][49] resource.go 260: Target config /etc/calico/confd/config/bird.cfg has been updated

In any case, the topic of this issue and the problem is again that the Node showed up as Ready and pods continued to be scheduled on it - even though network was broken. If the node would be marked as unhealthy, our automation could have replaced the node automatically.

Thank you and let me know if you need more information!

danielfoehrKn avatar Jul 10 '20 14:07 danielfoehrKn

In our case, when we provision new nodes and their ToR's BGP peer isn't yet configured, bird can't connect, but the nodes incorrectly show up as Ready in kubernetes


# kubectl get node
NAME                  STATUS   ROLES    AGE    VERSION
prod-01-node-t09-1    Ready    <none>   14m    v1.16.8


root@prod-01-node-t09-1:~# calicoctl.sh node status
Calico process is running.

IPv4 BGP status
+---------------+---------------+-------+----------+--------------------------------+
| PEER ADDRESS  |   PEER TYPE   | STATE |  SINCE   |              INFO              |
+---------------+---------------+-------+----------+--------------------------------+
| 10.100.66.254 | node specific | start | 17:54:03 | Connect Received: Connection   |
|               |               |       |          | rejected                       |
+---------------+---------------+-------+----------+--------------------------------+

grandich avatar Mar 04 '21 18:03 grandich

Where are we with that issue? I am currently having a node where calico-node is not ready but is still being shown as ready.

NAME                                                   READY   STATUS    RESTARTS   AGE   IP             NODE                       NOMINATED NODE   READINESS GATES
pod/calico-node-twhv7                                  0/1     Running   3          28h   10.1.32.33     5-21-354-1178-1-c47f7fa5   <none>           <none>
Name:               5-21-354-1178-1-c47f7fa5
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    f5role=worker
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=5-21-354-1178-1-c47f7fa5
                    kubernetes.io/os=linux
                    location=dev
                    kubeadm.alpha.kubernetes.io/cri-socket: /run/containerd/containerd.sock
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4Address: 10.1.32.33/32
                    projectcalico.org/IPv4IPIPTunnelAddr: 10.101.13.64
                    projectcalico.org/labels: {"edge":"true"}
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 01 Feb 2022 15:33:02 +0100
Taints:             node.kubernetes.io/unschedulable:NoSchedule
Unschedulable:      true
Lease:
  HolderIdentity:  5-21-354-1178-1-c47f7fa5
  AcquireTime:     <unset>
  RenewTime:       Wed, 02 Feb 2022 19:41:22 +0100
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Wed, 02 Feb 2022 11:22:41 +0100   Wed, 02 Feb 2022 11:22:41 +0100   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Wed, 02 Feb 2022 19:39:03 +0100   Wed, 02 Feb 2022 11:22:25 +0100   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Wed, 02 Feb 2022 19:39:03 +0100   Wed, 02 Feb 2022 11:22:25 +0100   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Wed, 02 Feb 2022 19:39:03 +0100   Wed, 02 Feb 2022 11:22:25 +0100   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Wed, 02 Feb 2022 19:39:03 +0100   Wed, 02 Feb 2022 11:22:25 +0100   KubeletReady                 kubelet is posting ready status

rgarcia89 avatar Feb 02 '22 18:02 rgarcia89

Currently, calico/node will mark a node as not ready when it is gracefully terminated (e.g., during rolling update or drain of the pod from a node) and will then mark the node as ready when it starts up again.

However, there is still no logic during steady-state to set NetworkUnavailable=true. We need to be very careful about introducing such logic, as it has the potential to break a cluster if not properly tuned.

caseydavenport avatar Feb 04 '22 17:02 caseydavenport

Is there any updates about this issue? I checked that calico v3.28.0 still faces this situation. As @caseydavenport said NetworkUnavailable does not go to true state during node reboot/poweroff, even with node being marked as not ready. Is there any plan for this problem to be addressed since https://github.com/kubernetes/kubernetes/issues/120486 was closed?

ferdinando-terada avatar Aug 20 '24 18:08 ferdinando-terada

This issue is stale because it is kind/enhancement or kind/bug and has been open for 180 days with no activity.

github-actions[bot] avatar Aug 11 '25 00:08 github-actions[bot]

This issue is stale because it is kind/enhancement or kind/bug and has been open for 180 days with no activity.

Boo!

lodotek avatar Aug 21 '25 15:08 lodotek

Is there any updates about this issue? I checked that calico v3.28.0 still faces this situation. As @caseydavenport said NetworkUnavailable does not go to true state during node reboot/poweroff, even with node being marked as not ready. Is there any plan for this problem to be addressed since kubernetes/kubernetes#120486 was closed?

Did you find a good solution?

lodotek avatar Aug 21 '25 15:08 lodotek

I think we might be able to improve this at least by moving the code that configures NetworkUnavailable on startup such that it doens't happen immediately, but only after some basic checks are confirmed (e.g., BIRD and Felix are both ready perhaps?)

caseydavenport avatar Aug 21 '25 15:08 caseydavenport

I took a first stab at what this might look like: https://github.com/projectcalico/calico/pull/10866

caseydavenport avatar Aug 21 '25 17:08 caseydavenport

https://github.com/projectcalico/calico/pull/10866 will show up in v3.31 - it improves the behavior here, but still relies on calico/node running in order to mark the node so I don't think it fully closes this request.

caseydavenport avatar Sep 04 '25 15:09 caseydavenport