operator icon indicating copy to clipboard operation
operator copied to clipboard

Operator does not cleanly allow node drains to complete

Open braunsonm opened this issue 1 year ago • 14 comments

Describe the bug On AWS EKS, nodes are set to SchedulingDisabled and pods are evicted in batches (not cordoned). With knative serving deployed using the operator, some workloads will never drain when HA is set to 3.

Expected behavior The Knative Operator should allow these components to drain without user interaction.

To Reproduce

  1. Deploy Knative Serving on EKS with HA set to 3 using the Knative Operator
  2. Upgrade the version of Kubernetes by updating the AMI template. This will trigger AWS to do a rolling upgrade of the nodes
  3. Notice that knative-serving components cause the operation to hand indefinitely until a human forcibly kills the pod in question.

Knative release version 1.13.0

Additional context I have enough nodes that the PDB shouldn't be violated.

braunsonm avatar Feb 12 '24 20:02 braunsonm

I found the problem. When HA is set to 3, the operator creates a PDB where minAvailable is set to 80%. This will never allow any of those pods to be evicted since 1 unavailable would be 66%.

braunsonm avatar Feb 13 '24 14:02 braunsonm

@braunsonm Thanks for reporting the issue. Do you have any suggestion on how operator can change or improve to avoid this issue?

houshengbo avatar Feb 27 '24 20:02 houshengbo

@houshengbo I think the operator should set maxUnavailable to 1 as a sensible default (roll each of these critical components one at a time). And continue to allow the user to override that.

braunsonm avatar Feb 27 '24 20:02 braunsonm

Is maxUnavailable for knative serving or knative eventing? I would rather say this configuration should be for them, right? Operator does not by default configure them, instead, it read the manifests for them and use the default values from serving or eventing. You can use operator CRs to configure PodDisruptionBudget.

houshengbo avatar Mar 08 '24 16:03 houshengbo

Yes and no. Max unavailable would be set for serving I think. But if you're configuring HA it would make sense that the operator creates a PDB so that HA is actually guaranteed. Otherwise you could still have an outage if the pods are evicted at the same time.

I agree allowing overrides though like you currently do.

braunsonm avatar Mar 08 '24 16:03 braunsonm

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Jun 07 '24 01:06 github-actions[bot]

/remove-lifecycle stale

9numbernine9 avatar Jun 07 '24 01:06 9numbernine9

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Sep 06 '24 01:09 github-actions[bot]

Still a problem

braunsonm avatar Sep 06 '24 01:09 braunsonm

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Dec 06 '24 01:12 github-actions[bot]

/remove-lifecycle stale

braunsonm avatar Dec 06 '24 01:12 braunsonm

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Mar 07 '25 01:03 github-actions[bot]

The following works for me, in a set up of 2 replicas HA. following the documentation here, I've added the following to the spec of my knative-serving object:

apiVersion: operator.knative.dev/v1beta1
kind: KnativeServing
...
spec:
  podDisruptionBudgets:
  - name: activator-pdb
    minAvailable: 40%
  - name: 3scale-kourier-gateway-pdb
    minAvailable: 40%
  - name: webhook-pdb
    minAvailable: 40%
...

MeirP-3 avatar Mar 27 '25 19:03 MeirP-3

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Jun 26 '25 01:06 github-actions[bot]