Update Kubernetes Node condition `NetworkUnavailable` in case of problems
Hi,
we are sometimes encountering problems with calico-node on Kubernetes.
Issue Description
The condition type NetworkUnavailable on a Kubernetes node seems to be only initialised to false on startup.
However, sometimes we are experiencing issues such as (from calico-node logs):
bird: Mesh_10_250_10_60: Socket error: bind: Address in use
bird: Mesh_10_250_19_13: Socket error: bind: Address in use
bird: Mesh_10_250_21_169: Socket error: bind: Address in use
The aforementioned node condition is still set to False with reason CalicoIsUp indicating that everything is working fine.
Expected Behavior
I would expect that calico's state is reflected in the Kubernetes node condition NetworkUnavailable (also in case of abnormalities during runtime).
Current Behavior
The condition type NetworkUnavailable on a Kubernetes node is only initialised to false on startup so problem during runtime are not reflected in the node condition.
Another possibility is, that calico sets the NetworkUnavailable too early during startup, with the actual error happening after the initialisation.
Possible Solution
Steps to Reproduce (for bugs)
Context
We are creating Kubernetes node via the Machine Controller Manager with calcio-node as deamon set.
We are monitoring the Node conditions including the NetworkUnavailable to be able to replace a node in case calico has problems.
Your Environment
- calico/node version: <TODO>
- Operating System and version: Ubuntu 18.04.4 LTS (though observed independently)
- Link to your project (optional):
To add to what @danielfoehrKn wrote, we are using calico v3.13.4.
Cross referencing https://github.com/projectcalico/calico/issues/435, which discusses the same.
I will close that one, since it's a bit old and originally was scoped just at setting the value on launch.
bird: Mesh_10_250_10_60: Socket error: bind: Address in use
Separate from this issue, do you guys know why you're encountering this? Sounds like it could be a bug. Might be worth opening another issue just to discuss this aspect of it.
Separate from this issue, do you guys know why you're encountering this? Sounds like it could be a bug. Might be worth opening another issue just to discuss this aspect of it.
My guess is that bird is trying to to re-bind to the node, not sure why though.
Also created another issue projectcalico/node#522 for the bind error.
I observed another incident when calico-node pod was not ready but the node condition NetworkUnavailable was healthy.
This time I do not think it has something to do with the Bird issue mentioned above, but a misconfiguration of the Kubernetes API server with mutating webhook configurations (Calico using Kubernetes as datastore). I am not 100% sure here (I do not know too much about the internals of calico) , but will paste the calico-node logs below.
bird: Mesh_10_250_0_4: State changed to down
bird: Reconfigured
2020-07-10 00:09:16.935 [INFO][49] resource.go 260: Target config /etc/calico/confd/config/bird.cfg has been updated
bird: Mesh_10_250_0_6: State changed to stop
bird: Mesh_10_250_0_6: State changed to down
bird: Mesh_10_250_0_6: Starting
bird: Mesh_10_250_0_6: State changed to start
bird: Mesh_10_250_0_3: State changed to start
2020-07-10 02:06:44.481 [INFO][49] util.go 66: /etc/calico/confd/config/bird.cfg has md5sum 15f7c1e79ac11f6e060b69f78724a3cc should be c28f3597463c952853af74688ea72f8c
2020-07-10 02:06:44.481 [INFO][49] resource.go 220: Target config /etc/calico/confd/config/bird.cfg out of sync
bird: Reconfiguration requested by SIGHUP
bird: Reconfiguration requested by SIGHUP
bird: Reconfiguring
bird: device1: Reconfigured
bird: direct1: Reconfigured
bird: Removing protocol Mesh_10_250_0_3
bird: Mesh_10_250_0_3: Shutting down
bird: Mesh_10_250_0_3: State changed to stop
bird: Mesh_10_250_0_6: Reconfigured
bird: Reconfiguring
bird: device1: Reconfigured
bird: direct1: Reconfigured
bird: Reconfigured
bird: Mesh_10_250_0_3: State changed to down
bird: Reconfigured
2020-07-10 02:06:44.490 [INFO][49] resource.go 260: Target config /etc/calico/confd/config/bird.cfg has been updated
In any case, the topic of this issue and the problem is again that the Node showed up as Ready and pods continued to be scheduled on it - even though network was broken.
If the node would be marked as unhealthy, our automation could have replaced the node automatically.
Thank you and let me know if you need more information!
In our case, when we provision new nodes and their ToR's BGP peer isn't yet configured, bird can't connect, but the nodes incorrectly show up as Ready in kubernetes
# kubectl get node
NAME STATUS ROLES AGE VERSION
prod-01-node-t09-1 Ready <none> 14m v1.16.8
root@prod-01-node-t09-1:~# calicoctl.sh node status
Calico process is running.
IPv4 BGP status
+---------------+---------------+-------+----------+--------------------------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+---------------+---------------+-------+----------+--------------------------------+
| 10.100.66.254 | node specific | start | 17:54:03 | Connect Received: Connection |
| | | | | rejected |
+---------------+---------------+-------+----------+--------------------------------+
Where are we with that issue? I am currently having a node where calico-node is not ready but is still being shown as ready.
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/calico-node-twhv7 0/1 Running 3 28h 10.1.32.33 5-21-354-1178-1-c47f7fa5 <none> <none>
Name: 5-21-354-1178-1-c47f7fa5
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
f5role=worker
kubernetes.io/arch=amd64
kubernetes.io/hostname=5-21-354-1178-1-c47f7fa5
kubernetes.io/os=linux
location=dev
kubeadm.alpha.kubernetes.io/cri-socket: /run/containerd/containerd.sock
node.alpha.kubernetes.io/ttl: 0
projectcalico.org/IPv4Address: 10.1.32.33/32
projectcalico.org/IPv4IPIPTunnelAddr: 10.101.13.64
projectcalico.org/labels: {"edge":"true"}
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Tue, 01 Feb 2022 15:33:02 +0100
Taints: node.kubernetes.io/unschedulable:NoSchedule
Unschedulable: true
Lease:
HolderIdentity: 5-21-354-1178-1-c47f7fa5
AcquireTime: <unset>
RenewTime: Wed, 02 Feb 2022 19:41:22 +0100
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Wed, 02 Feb 2022 11:22:41 +0100 Wed, 02 Feb 2022 11:22:41 +0100 CalicoIsUp Calico is running on this node
MemoryPressure False Wed, 02 Feb 2022 19:39:03 +0100 Wed, 02 Feb 2022 11:22:25 +0100 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Wed, 02 Feb 2022 19:39:03 +0100 Wed, 02 Feb 2022 11:22:25 +0100 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Wed, 02 Feb 2022 19:39:03 +0100 Wed, 02 Feb 2022 11:22:25 +0100 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Wed, 02 Feb 2022 19:39:03 +0100 Wed, 02 Feb 2022 11:22:25 +0100 KubeletReady kubelet is posting ready status
Currently, calico/node will mark a node as not ready when it is gracefully terminated (e.g., during rolling update or drain of the pod from a node) and will then mark the node as ready when it starts up again.
However, there is still no logic during steady-state to set NetworkUnavailable=true. We need to be very careful about introducing such logic, as it has the potential to break a cluster if not properly tuned.
Is there any updates about this issue? I checked that calico v3.28.0 still faces this situation. As @caseydavenport said NetworkUnavailable does not go to true state during node reboot/poweroff, even with node being marked as not ready. Is there any plan for this problem to be addressed since https://github.com/kubernetes/kubernetes/issues/120486 was closed?
This issue is stale because it is kind/enhancement or kind/bug and has been open for 180 days with no activity.
This issue is stale because it is kind/enhancement or kind/bug and has been open for 180 days with no activity.
Boo!
Is there any updates about this issue? I checked that calico v3.28.0 still faces this situation. As @caseydavenport said NetworkUnavailable does not go to true state during node reboot/poweroff, even with node being marked as not ready. Is there any plan for this problem to be addressed since kubernetes/kubernetes#120486 was closed?
Did you find a good solution?
I think we might be able to improve this at least by moving the code that configures NetworkUnavailable on startup such that it doens't happen immediately, but only after some basic checks are confirmed (e.g., BIRD and Felix are both ready perhaps?)
I took a first stab at what this might look like: https://github.com/projectcalico/calico/pull/10866
https://github.com/projectcalico/calico/pull/10866 will show up in v3.31 - it improves the behavior here, but still relies on calico/node running in order to mark the node so I don't think it fully closes this request.