cloudnative-pg icon indicating copy to clipboard operation
cloudnative-pg copied to clipboard

[Bug]: POD eviction error at the time of Node Maintenance

Open vaiddas opened this issue 8 months ago • 2 comments

Is there an existing issue already for this bug?

  • [x] I have searched for an existing issue, and could not find anything. I believe this is a new bug.

I have read the troubleshooting guide

  • [x] I have read the troubleshooting guide and I think this is a new bug.

I am running a supported version of CloudNativePG

  • [x] I have read the troubleshooting guide and I think this is a new bug.

Contact Details

[email protected]

Version

1.24 (latest patch)

What version of Kubernetes are you using?

1.30

What is your Kubernetes environment?

Cloud: Azure AKS

How did you install the operator?

Helm

What happened?

We use Azure AKS and we have setup postgres with 2 instances. Our AKS maintenance windows run every Saturday and Azure performs the underlying node upgrade. We are noticing that upgrade fails and error is:

(UpgradeFailed) Drain node failed when evicting pod failed with Too Many Requests error. This is often caused by a restrictive Pod Disruption Budget (PDB) policy. See http://aka.ms/aks/debugdrainfailures. Original error: Cannot evict pod as it would violate the pod's disruption budget.. PDB debug info: blocked by pdb (MinAvailable: 1) (CurrentHealthy: 1) (DesiredHealthy: 1) (ExpectedPods: 1) with 0 un Code: UpgradeFailed

We are running this postgres with 2 instances all the time so not sure why we see this issue?

also PDB shown as

kubectl get pdb -n <namespace>
NAME                      MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
<name>                   1               N/A               0                     5d1h

When we disable the PDB then the error goes away and node upgrade activity works fine without any issue.

This is causing our Nodepool and Cluster status to be in the Failed state after each of the upgrade activity.

As a workaround we have disabled the enablePDB but wanted to check how we can fix this while keeping PDB enabled?

Cluster resource


Relevant log output


Code of Conduct

  • [x] I agree to follow this project's Code of Conduct

vaiddas avatar Apr 14 '25 00:04 vaiddas

Hello @vaiddas! I'm here to help you tackle bugs, answer questions, and assist you in becoming a contributor. While we wait for a human maintainer, feel free to share more details or ask anything you need assistance with!

The issue you're experiencing with Pod Disruption Budgets (PDBs) during Azure AKS maintenance is likely due to the configuration of the PDB, which is preventing the eviction of pods during node upgrades. In a two-instance PostgreSQL setup, the PDB is configured to ensure that at least one instance remains available, which can block node draining if both instances are needed to maintain the minimum availability.

Here are some steps you can take to address this issue while keeping PDBs enabled:

  1. Adjust PDB Configuration: Consider modifying the PDB to allow for more flexibility during maintenance. You can set maxUnavailable to 1 instead of relying solely on minAvailable. This configuration allows one pod to be evicted during maintenance, which can help in scenarios where node upgrades are necessary.

  2. Increase Instance Count: If possible, increase the number of PostgreSQL instances to three. This setup allows for more flexibility during node maintenance, as one instance can be evicted while still maintaining the required availability.

  3. Schedule Maintenance Windows: Align your maintenance windows with periods of low database activity to minimize the impact of potential disruptions.

  4. Use Shared Storage: If your setup allows, consider using shared storage that can be accessed by pods on different nodes. This setup can help in quickly recovering from node drains as the storage can be reused by pods on other nodes.

  5. Review AKS Maintenance Configuration: Ensure that your AKS maintenance configuration is optimized for your workload, and consider staggering node upgrades to reduce the impact on your PostgreSQL cluster.

By implementing these strategies, you should be able to maintain the benefits of PDBs while minimizing disruptions during Azure AKS maintenance operations.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

dosubot[bot] avatar Apr 14 '25 00:04 dosubot[bot]

Same thing happens in GKE

A Pod Disruption Budget exists that allows for 0 Pod evictions. A Pod Disruption Budget must allow for at least 1 Pod eviction so that GKE can perform maintenance. Learn more

Image

I wonder what the official posture to fix this is. enablePDB: false is very tempting.

p4block avatar Jun 12 '25 11:06 p4block

Team, we had increased Instance count to 3 but we now see eviction error for primary instance. During routine maintenance primary instance throw the error while eviction.

NAME READY STATUS RESTARTS AGE sampledb-cluster-1 1/1 Running 0 23h sampledb-cluster-2 1/1 Running 0 23h sampledb-cluster-3 1/1 Running 0 23h

NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE sampledb-cluster 1 N/A 1 70d sampledb-cluster-primary 1 N/A 0 70d

Can you please suggest how to fix this?

vaiddas avatar Jun 23 '25 22:06 vaiddas

I've been running into this issue as well in Google Cloud. Automatic node upgrades no longer work.

sarahhenkens avatar Jul 01 '25 04:07 sarahhenkens

This issue is stale because it has been open for 60 days with no activity.

github-actions[bot] avatar Aug 31 '25 02:08 github-actions[bot]

@gbartolini, we would love to hear your thoughts on this ticket. There are several reports of this default behavior breaking automatic maintenance windows of cloud providers. #7455, #6858.

It would be great if we can have the option to just not have it emit that PDB.

sarahhenkens avatar Sep 13 '25 04:09 sarahhenkens

Team, Any update on fixing this?

vaiddas avatar Nov 06 '25 00:11 vaiddas