karpenter Emit event when pod eviction is blocked due to pod disruption budget

Description

Have zero downtime for applications having single replica during consolidation/drift of underlined node

Very important (Blocker for adopting karpenter)

Hi,

We have cluster with development environments and each pod is a single replica. During the consolidation, Karpenter is deleting this single pod so it transforms to Terminating status and new pod is in Init status. Of course it cause downtime as new connections cannot be routed to Terminating pod.

We'd like to have some option to control how pods are rescheduled during consolidation. I think that maybe after node cordon, ideally we want to do rollout restart of pods instead of draining nodes.

I could see one pull request regarding that which had been closed at the end of last year.

So can we do something regarding this? May be you can just emit a event from nodeclaim when it is about to be disrupted, we can catch the event from our own custom controller and do roleout restart of existing deployment and statefulsets which would reschedule the workloads in other nodes and then eventually karpenter would automatically taint and delete the existing node. We tried to follow this approach but what we seen that DisruptionBlocked event is being emitted continuously if app with 1 replica and PDB exist simultaneously irrespective of whether Node is disruptable or not. So we really can't run any logic based on DisruptionBlocked event and it's kind of a false alarm for us.

In a nutshell we need a event when actually node could not be disrupted (Not as a validation check like current DisruptionBlocked event ) because of presence of PDB. May be toggling the sequence of DisruptionBlocked and Unconsolidatable would help

Below is the sequence of existing events

  Normal  DisruptionBlocked  116s   karpenter  Cannot disrupt NodeClaim: pdb "cwe/nginx" prevents pod evictions
  Normal  Unconsolidatable   35s    karpenter  NodePool "default" has non-empty consolidation disabled

Aug 26 '24 07:08 indra0007

This is also a big loss in the availability of my production application 😢

Aug 26 '24 08:08 varunpalekar

Can we disable this validation check conditionally based on some external NodePool configuration? We are hoping that if we can disable that somehow then later we can catch message here in form of some events when nodes are in hung state and rollout restart the deployments to unblock and and let the node get deleted?

Aug 26 '24 12:08 indra0007

What if we proceed with a rollout restart policy if the current situation doesn't align with the PDB configuration for all workloads(Not only single replica)?

The potential downside is that the pending pods from the restart might trigger the creation of new nodes, which could result in a never unstable environment.

Aug 27 '24 08:08 jwcesign

Out of curiosity, I forked this repo and then commented out this validation check, and deploy that customised image in my cluster. I found out that a node (with app with single replica as well as PDB) is having below set of events in its corresponding nodeclaim. We can clearly see that node deletion is blocked for PDB violation. I guess DisruptionTerminating is key event here. When I did rollout restart of the existing deployment having one replica and PDB, it just gracefully scheduled the pod in other app and after that node got deleted.

So if we can get this request accepted then we can potentially look for DisruptionTerminating event from our custom controller and then just restart only those deploy/sts which are having one replica and PDB. For rest of the deploy/sts karpenter would automatically take care.

Not a nice solution but effective one

  Normal   Launched                34m                karpenter  Status condition transitioned, Type: Launched, Status: Unknown -> True, Reason: Launched
  Normal   DisruptionBlocked       34m                karpenter  Cannot disrupt NodeClaim: state node doesn't contain both a node and a nodeclaim
  Normal   Registered              33m                karpenter  Status condition transitioned, Type: Registered, Status: Unknown -> True, Reason: Registered
  Normal   Initialized             33m                karpenter  Status condition transitioned, Type: Initialized, Status: Unknown -> True, Reason: Initialized
  Normal   Ready                   33m                karpenter  Status condition transitioned, Type: Ready, Status: Unknown -> True, Reason: Ready
  Normal   DisruptionBlocked       27m (x3 over 31m)  karpenter  Cannot disrupt NodeClaim: state node is nominated for a pending pod
  Normal   Unconsolidatable        11m (x2 over 26m)  karpenter  SpotToSpotConsolidation is disabled, can't replace a spot node with a spot node
  Normal   DisruptionTerminating   9m14s              karpenter  Disrupting NodeClaim: Drifted/Replace
  Warning  FailedConsistencyCheck  4m                 karpenter  can't drain node, PDB "cwe/nginx" is blocking evictions
  Normal   ConsistentStateFound    4m                 karpenter  Status condition transitioned, Type: ConsistentStateFound, Status: True -> False, Reason: ConsistencyCheckFailed, Message: Consistency Check Failed
  Normal   DisruptionBlocked       35s (x6 over 10m)  karpenter  Cannot disrupt NodeClaim: state node is marked for deletion

Aug 27 '24 14:08 indra0007

/assign

I will bring this issue to the community meeting and follow the next things.

Aug 30 '24 10:08 jwcesign

Hey, can you control this on the pod level by having maxUnavailable = 0 and maxSurge = 1 ? To make K8s create a new pod first before proceeding with removing the previous?

Sep 04 '24 16:09 hdimitriou

@hdimitriou No that's not possible. k8s evict apis (which karpenter follows) does not respect maxUnavailable = 0 and maxSurge = 1. Before creating a new pod it would delete the existing pod

Sep 19 '24 12:09 indra0007

@jwcesign, are there any updates on the community review of that? Current behaviour breaks zero downtime on the pod moving between nodes. Karpenter should wait until new pods are marked as healthy before destroying old ones.

Oct 17 '24 05:10 vvchik

@indra0007

No that's not possible. k8s evict apis (which karpenter follows) does not respect maxUnavailable = 0 and maxSurge = 1. Before creating a new pod it would delete the existing pod

Hi,

I have a similar situation where I have an application with a single replica and I can't perform rebalancing without downtime. Are there any hacks or workarounds? For example, using preStop or terminationGracePeriodSeconds so that Kubernetes runs a new pod in parallel (handled by the deployment) while the old one is still alive for the time needed to start the new one.

I've also seen similar issues, but they mentioned custom controllers and handlers:

https://github.com/aws/karpenter-provider-aws/issues/500
https://github.com/aws/karpenter-provider-aws/issues/6086

Oct 27 '24 22:10 Frodox

We are also thinking of creating custom controller to handle it, but currently we need some event based on which we trigger the controller working and what this thread about.

Once this will close we can also plan to opensource it.

Nov 05 '24 12:11 varunpalekar

I am facing the same issue. Karpenter currently breaks zero-downtime, which is a big deal for my use-case.

Any updates on this issue ?

Nov 21 '24 09:11 erikpartila

I am having the same issue running karpenter in development environment for pods having single replica, any updates ?

Jan 27 '25 12:01 Volodymyr-Kuchinskyi

/assign jmdeal

Feb 03 '25 23:02 rschalo

/triage needs-investigation

Feb 17 '25 17:02 jmdeal

same issues here with karpenter v1.0.9

May 10 '25 17:05 ariretiarno

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Aug 08 '25 18:08 k8s-triage-robot

/remove-lifecycle stale

Aug 29 '25 08:08 frittentheke

karpenter karpenter copied to clipboard

Emit event when pod eviction is blocked due to pod disruption budget

Description

karpenter
karpenter copied to clipboard