calico icon indicating copy to clipboard operation
calico copied to clipboard

Change operator defaults so that Typha no longer ignores node cordoning

Open SammyA opened this issue 1 year ago • 5 comments

Typha deployment is created by the operator with very wide tolerations, ignoring node cordoning.

Expected Behavior

If one or more nodes are cordoned, typha pods should reschedule on another non-cordoned node.

Current Behavior

When nodes are drained, and therefore very lightly loaded, typha prefers to schedule on these cordoned nodes. When these nodes are eventually brought offline, typha deployment will not have the requested number of replicas, as new pods scheduled to satisfy the replicaset will be scheduled on these down nodes.

Possible Solution

We're running the operator with these tolerations, which seems to result in much better behaviour

- effect: NoSchedule
  key: node.kubernetes.io/not-ready
  operator: Exists
- effect: NoSchedule
  key: node.kubernetes.io/network-unavailable
  operator: Exists
- effect: NoExecute
  key: node.kubernetes.io/not-ready
  operator: Exists
  tolerationsSeconds: 300
- effect: NoExecute
  key: node.kubernetes.io/unreachable
  operator: Exists
  tolerationsSeconds: 300

Steps to Reproduce (for bugs)

cordon and drain nodes on which typha pods are running

Context

Your Environment

  • Calico version: v3.27.0
  • Orchestrator version (e.g. kubernetes, mesos, rkt): kubeadm

SammyA avatar Dec 22 '23 10:12 SammyA

Thanks for raising - this looks very much like this closed issue: https://github.com/projectcalico/calico/issues/6136

The TL;DR I think right now is:

  • Our default tolerations are very permissive, and don't work well with cordoning.
  • We do support custom tolerations if you want to adjust the behavior to play nicer when cordoning nodes
  • We haven't yet changed our defaults, as doing so may impact other users in unexpected ways.

I still think we probably should change the defaults, but we haven't found the correct way yet. First step is probably to ensure we write the existing defaults back to the Installation API for some number of releases to ensure that users won't experience a change in behavior on upgrade. Once that is completed, we can safely adjust what our code sets as the default without fear or impacting existing users who may be relying on the current defautls.

caseydavenport avatar Dec 27 '23 17:12 caseydavenport

I would second this and also add a recommendation to increase the priority of updating this. The issue is not just with cordoning, it can also cause outages: With the default typha tolerations, if the node that typha is running on happens to die, the typha pod never gets rescheduled so the service stays down. This breaks networking across the whole cluster on just a single node failure.

The recommended settings mentioned above also remediate that case, specifically using tolerationSeconds: 300 on any NoExecute toleration.

cmulk avatar Apr 18 '24 00:04 cmulk

if the node that typha is running on happens to die, the typha pod never gets rescheduled so the service stays down

Could you explain this failure mode a little bit more? I would have thought that (1) Typha would get rescheduled to an available node that satisfies the (very broad) tolerations in use, and (2) that we run multiple copies of Typha to avoid a single node causing a problem.

caseydavenport avatar Apr 18 '24 21:04 caseydavenport

(1) Available nodes includes the one that just went down. Since it was previously scheduled on this node, there is a high probability that it will not be rescheduled on a different node. In fact, since most other workloads will have been rescheduled on the remaining nodes, the down node will be very "lightly loaded", and typha will prefer to be scheduled there.

(2) When there are sufficient remaining typha replicas left, the service should not go down, but remains in a degraded state.

SammyA avatar Apr 19 '24 08:04 SammyA

Sure thing @caseydavenport this is what I'm seeing: I am running a k3s cluster with 1 controlplane and 2 worker nodes. For (2), this cluster only has one instance of typha, I assume because it is smaller than the threshold for the operator to scale up to multiple instances.

For (1), when a node dies, it gets marked as NoExecute. After 5 minutes all the other pods on that node get evicted and rescheduled to another working node except for typha. Since typha has the unlimited NoExecute toleration, it just stays in a false Ready state and kubernetes never even tries to reschedule it.

The Possible Solution mentioned above does resolve this case. After the tolerationSeconds expire, typha gets rescheduled on a working node and the cluster continues to function.

cmulk avatar Apr 19 '24 12:04 cmulk