[Bug]: POD eviction error at the time of Node Maintenance
Is there an existing issue already for this bug?
- [x] I have searched for an existing issue, and could not find anything. I believe this is a new bug.
I have read the troubleshooting guide
- [x] I have read the troubleshooting guide and I think this is a new bug.
I am running a supported version of CloudNativePG
- [x] I have read the troubleshooting guide and I think this is a new bug.
Contact Details
Version
1.24 (latest patch)
What version of Kubernetes are you using?
1.30
What is your Kubernetes environment?
Cloud: Azure AKS
How did you install the operator?
Helm
What happened?
We use Azure AKS and we have setup postgres with 2 instances. Our AKS maintenance windows run every Saturday and Azure performs the underlying node upgrade. We are noticing that upgrade fails and error is:
(UpgradeFailed) Drain node
failed when evicting pod failed with Too Many Requests error. This is often caused by a restrictive Pod Disruption Budget (PDB) policy. See http://aka.ms/aks/debugdrainfailures. Original error: Cannot evict pod as it would violate the pod's disruption budget.. PDB debug info: blocked by pdb (MinAvailable: 1) (CurrentHealthy: 1) (DesiredHealthy: 1) (ExpectedPods: 1) with 0 un Code: UpgradeFailed
We are running this postgres with 2 instances all the time so not sure why we see this issue?
also PDB shown as
kubectl get pdb -n <namespace>
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
<name> 1 N/A 0 5d1h
When we disable the PDB then the error goes away and node upgrade activity works fine without any issue.
This is causing our Nodepool and Cluster status to be in the Failed state after each of the upgrade activity.
As a workaround we have disabled the enablePDB but wanted to check how we can fix this while keeping PDB enabled?
Cluster resource
Relevant log output
Code of Conduct
- [x] I agree to follow this project's Code of Conduct
Hello @vaiddas! I'm here to help you tackle bugs, answer questions, and assist you in becoming a contributor. While we wait for a human maintainer, feel free to share more details or ask anything you need assistance with!
The issue you're experiencing with Pod Disruption Budgets (PDBs) during Azure AKS maintenance is likely due to the configuration of the PDB, which is preventing the eviction of pods during node upgrades. In a two-instance PostgreSQL setup, the PDB is configured to ensure that at least one instance remains available, which can block node draining if both instances are needed to maintain the minimum availability.
Here are some steps you can take to address this issue while keeping PDBs enabled:
-
Adjust PDB Configuration: Consider modifying the PDB to allow for more flexibility during maintenance. You can set
maxUnavailableto1instead of relying solely onminAvailable. This configuration allows one pod to be evicted during maintenance, which can help in scenarios where node upgrades are necessary. -
Increase Instance Count: If possible, increase the number of PostgreSQL instances to three. This setup allows for more flexibility during node maintenance, as one instance can be evicted while still maintaining the required availability.
-
Schedule Maintenance Windows: Align your maintenance windows with periods of low database activity to minimize the impact of potential disruptions.
-
Use Shared Storage: If your setup allows, consider using shared storage that can be accessed by pods on different nodes. This setup can help in quickly recovering from node drains as the storage can be reused by pods on other nodes.
-
Review AKS Maintenance Configuration: Ensure that your AKS maintenance configuration is optimized for your workload, and consider staggering node upgrades to reduce the impact on your PostgreSQL cluster.
By implementing these strategies, you should be able to maintain the benefits of PDBs while minimizing disruptions during Azure AKS maintenance operations.
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other
Same thing happens in GKE
A Pod Disruption Budget exists that allows for 0 Pod evictions. A Pod Disruption Budget must allow for at least 1 Pod eviction so that GKE can perform maintenance. Learn more
I wonder what the official posture to fix this is. enablePDB: false is very tempting.
Team, we had increased Instance count to 3 but we now see eviction error for primary instance. During routine maintenance primary instance throw the error while eviction.
NAME READY STATUS RESTARTS AGE sampledb-cluster-1 1/1 Running 0 23h sampledb-cluster-2 1/1 Running 0 23h sampledb-cluster-3 1/1 Running 0 23h
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE sampledb-cluster 1 N/A 1 70d sampledb-cluster-primary 1 N/A 0 70d
Can you please suggest how to fix this?
I've been running into this issue as well in Google Cloud. Automatic node upgrades no longer work.
This issue is stale because it has been open for 60 days with no activity.
@gbartolini, we would love to hear your thoughts on this ticket. There are several reports of this default behavior breaking automatic maintenance windows of cloud providers. #7455, #6858.
It would be great if we can have the option to just not have it emit that PDB.
Team, Any update on fixing this?