cloudnative-pg icon indicating copy to clipboard operation
cloudnative-pg copied to clipboard

[Bug]: Karpenter node disruption causes CloudNative-PG cluster downtime despite PDB and multi-instance configuration

Open sats268842 opened this issue 3 months ago • 4 comments

Is there an existing issue already for this bug?

  • [x] I have searched for an existing issue, and could not find anything. I believe this is a new bug.

I have read the troubleshooting guide

  • [x] I have read the troubleshooting guide and I think this is a new bug.

I am running a supported version of CloudNativePG

  • [x] I have read the troubleshooting guide and I think this is a new bug.

Contact Details

[email protected]

Version

1.27 (latest patch)

What version of Kubernetes are you using?

1.32

What is your Kubernetes environment?

Cloud: Amazon EKS

How did you install the operator?

Helm

What happened?

CloudNative-PG cluster experiences database downtime (few minutes) when Karpenter node disruption budgets trigger node replacements. Despite having Pod Disruption Budget (PDB) enabled and a 3-instance cluster configuration, applications lose database connectivity during node transitions.

The issue occurs when:

  1. Karpenter disruption budget activates (scheduled consolidation for underutilized nodes)
  2. Nodes hosting PostgreSQL pods get marked for termination/replacement
  3. Database becomes unavailable for several minutes during node transitions
  4. Applications experience connection timeouts and errors

What did you expect to happen?

With a 3-instance PostgreSQL cluster and PDB enabled, I expected:

  • Database to maintain availability during node disruptions
  • Pod eviction to respect PDB limits ensuring at least one instance remains available
  • Graceful failover without service interruption
  • Leader election to complete seamlessly during node transition

Cluster resource

## Environment Details

- **Kubernetes**: AWS EKS
- **Node Autoscaler**: Karpenter with business-hours aware disruption budgets
- **CloudNative-PG**: Latest version (deployed via Helm chart)
- **Storage**: AWS EBS CSI Driver with `ebs-sc` storage class
- **Node Pool**: Critical applications node pool with dedicated taints
- **Zone Configuration**: Single zone constraint (`us-east-1a`)

Relevant log output


Code of Conduct

  • [x] I agree to follow this project's Code of Conduct

sats268842 avatar Sep 06 '25 11:09 sats268842

Hi @sats268842! I'm Dosu and I’m helping the cloudnative-pg team.

There is an open issue tracking this exact problem: CloudNativePG currently does not recognize Karpenter's disruption taint (karpenter.sh/disruption:NoSchedule), so when Karpenter replaces nodes, the operator does not automatically trigger a switchover or handle pod eviction as it would with standard Kubernetes cordon/drain signals. This leads to downtime and blocked node cycling, even with PDBs and a multi-instance setup. Disabling PDBs is a possible workaround, but it's not recommended for production due to the risk of unplanned downtime [source].

Your single-AZ, EBS-backed setup can also introduce delays during pod rescheduling, which can further increase downtime during node transitions, even if the operator logic is working as intended [source].

A feature to honor Karpenter taints was introduced in v1.26, but users still report issues with automatic failover and node cycling in Karpenter-managed clusters. Monitoring the progress of issue #5299 is the best way to track improvements in this area.

For now, you can mitigate downtime by: monitoring node events and manually triggering failover when Karpenter marks nodes for disruption, considering a multi-AZ deployment to reduce EBS-related delays, and reviewing cluster logs and events during disruptions to pinpoint where delays occur. If you have logs or events from the downtime window, they may help clarify whether the delay is due to pod scheduling, storage attachment, or failover logic.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other  Chat with Dosu Join Discord Share on X

dosubot[bot] avatar Sep 06 '25 11:09 dosubot[bot]

Is it affecting due to below config:

smartShutdownTimeout: 180
startDelay: 3600
stopDelay: 1800

sats268842 avatar Sep 18 '25 18:09 sats268842

Yes, those config values can directly increase downtime during node disruptions. With smartShutdownTimeout: 180, startDelay: 3600, and stopDelay: 1800, CloudNativePG will allow up to 3 minutes for a smart shutdown, up to 30 minutes for a full shutdown, and up to 1 hour for a pod to become ready after startup. These are the defaults and are designed for resilience, but in environments where fast failover is needed (like Karpenter-managed clusters or single-AZ EBS), they can make pods unavailable for several minutes during node transitions, especially if storage reattachment is slow or multiple pods are disrupted at once.

Tuning startDelay and stopDelay down to the minimum your workload can tolerate will help reduce downtime during node disruptions. For example, some users set startDelay to 60–300 and stopDelay to 60–300 for faster failover, but you should test what works for your workload and data safety requirements. The operator and pod lifecycle logic fully respect these values, so they are a key lever for minimizing disruption time [docs].

Keep in mind that there are also known issues with Karpenter taint handling and node cycling, but these timing parameters are a significant factor in how long the database is unavailable during disruptions [example].

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other  Chat with Dosu Join Discord Share on X

dosubot[bot] avatar Sep 18 '25 18:09 dosubot[bot]

This issue is stale because it has been open for 60 days with no activity.

github-actions[bot] avatar Nov 28 '25 02:11 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Dec 12 '25 02:12 github-actions[bot]