karpenter-provider-aws icon indicating copy to clipboard operation
karpenter-provider-aws copied to clipboard

Interruption handling: handle "Rebalance Recommandation" events

Open maximethebault opened this issue 3 years ago • 18 comments

Tell us about your request

Be able to handle "Rebalance Recommandation" events Should be configurable at the Provisioner level

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Gracefully exiting apps running on spots that are slow to kill.

The 2-minutes spot interruption warning is not always enough, for example when an application is exposed via a NLB in IP Target mode. The deregistration in this case is known to be painfully slow: https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/1834

In this case, you need more time to kill your application gracefully and Rebalance Recommandation can be helpful to prevent such situations.

However, handling "Rebalance Recommandation" is not always desired. This should be configurable at the Provisioner level (or finer-grained).

Are you currently working around this issue?

Using NTH, but not able to configure the behavior at the Provisioner level

Additional Context

https://kubernetes.slack.com/archives/C02SFFZSA2K/p1667808766541129

Attachments

No response

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

maximethebault avatar Nov 07 '22 18:11 maximethebault

I guess NTH is totally fine in this case. It correctly handles REBALANCE_RECOMENDATION on both ds/queue modes

sergeyshevch avatar Nov 10 '22 10:11 sergeyshevch

NTH is fine, but it was less clear what Karpenter should do with these events since there is no user-configurable API surface for rebalanceRecommendation events.

Since rebalance recommendations don't actually mean that your capacity is definitely going away, we need some kind of user-defined behavior for this kind of event.

@maximethebault It sounds like what you are proposing is an application-based dimensionality for handling rebalance recommendations. i.e. some applications might need to react to rebalance recommendation while others might not need to?

jonathan-innis avatar Nov 10 '22 18:11 jonathan-innis

That's perfectly sumed up indeed.

There are some applications that either have long grace periods or are directly served by a NLB and for which the 2-minutes Spot Interruption Warning is not enough => we want to handle rebalance recommandations to minimize probability of downtime.

There are other applications (typically, batch jobs), for which interruption is not an issue and rebalance recommandation is usually counter-productive. e.g. when workloads need a "rare" instance type that easily gets rebalance recommandation and that Karpenter is going to spawn again (leading to some kind of infinite loop with Karpenter trying repeatedly to re-spawn the same instance type NTH is draining)

So yeah, theoretically we would like to be able to configure this at the application level, but since a rebalance recommandation affects a whole node, it would make sense to move this configuration up at the Provisioner level.

maximethebault avatar Nov 11 '22 00:11 maximethebault

What if we handled rebalance recommendations if a node contains a pod with >2 minute terminationGracePeriod?

If you need to separate workloads with differing grace periods, you can do so with scheduling preferences/requirements.

ellistarn avatar Nov 11 '22 01:11 ellistarn

Not sure if taking decision automatically based on the value of terminationGracePeriod would cover all use-cases. There are numerous workarounds requiring to use preStop hooks rather than terminationGracePeriod.

Agreed on workload seperation, we currently use two seperate Provisioners for each kind of workloads.

maximethebault avatar Nov 11 '22 01:11 maximethebault

IIUC preStop hooks are encapsulated by the TerminationGracePeriod

ellistarn avatar Nov 14 '22 22:11 ellistarn

You're totally right! Automation can be an option indeed then.

maximethebault avatar Nov 15 '22 00:11 maximethebault

The EKS Managed Node Group handles the rebalance recommendation events safely for capacity. That is, the new node is first joined to the cluster, then, the notified node is removed. The behavior relies on EC2 Auto Scaling. However, Karpenter should handle the rebalance recommendation in the same way.

Especially when Pod Disruption Budget is configured, Karpenter retries to terminate the node. Therefore, if many nodes receive rebalancing events, node termination may take a long time. If the spot node is terminated during it, it will lead to a service disruption. On the other hand, if a new node is started when a rebalancing event is received, the the pods move to the new node more quickly.

literalice avatar Dec 13 '22 07:12 literalice

Just want to leave my 2 cents. My understanding is that right now, Karpenter does not handle rebalance recommendations. We would prefer that when it does, it is configurable (read: we can disable it), as rebalance recommendations would cause us to have issues in production. To give you an idea of numbers (we run NTH to get slack messages/metrics), yesterday in Frankfurt, our clusters received, but did not take action on, 803 REBALANCE_RECOMMENDATIONs. We had only 139 SPOT_ITN in comparison. Worth noting that rebalance recommendations didn't used to be this bad. FYI we manually disable capacity rebalance on our underlying ASGs.

We are currently working on migrating from managed node groups + cluster autoscaler to Karpenter, and we're worried that when Karpenter starts taking action on rebalance recommendations, we wont' be able to turn it off and we'll have to disable the built in interruption handling and go back to using NTH. :cry:

jtnz avatar Feb 24 '23 02:02 jtnz

@jtnz When we surface rebalance recommendations as an option for Karpenter to act on, we will absolutely make this configurable, so don't worry about Karpenter not having an opt-out mechanism 👍 .

jonathan-innis avatar Feb 24 '23 17:02 jonathan-innis

One question, does karpenter launches new node for rebalance recommendations if we have NTH in place? Also, if yes, then does this node comes first and then NTH proceeds for cordon+drain of the recommended node?

chavan-suraj avatar May 11 '23 13:05 chavan-suraj

@chavan-suraj not at the moment. https://karpenter.sh/v0.27.3/concepts/deprovisioning/#interruption you can get all up-to-date information here.

njtran avatar May 11 '23 20:05 njtran

I'm currently facing an issue using NTH to manage rebalance recommendations events and Karpenter at the same time. Kapenter is trying to launch an m5.4xlarge instance, and as soon as it enters the cluster NTH receives a rebalance recommendation and drains the node. But after the NTH drain, Karpenter tries again to launch an m5.4xlarge and the loop goes on. Do you have any tips on what should i do in that case? Also, do you have any plans or roadmap for Karpenter to be able to handle those rebalance recommendations events?

ltellesfl avatar Jun 26 '23 21:06 ltellesfl

Do you have any tips on what should i do in that case

Karpenter has an ICE (insufficient capacity exception) cache that keeps track of the offerings that receive interruption events so that it doesn't try to launch the same instance that it just received an interruption event for. Because Karpenter isn't currently aware of rebalance recommendations, it doesn't update this ICE cache so this is why it relaunches with the same offering.

Unfortunately, there's no way today to make Karpenter aware of these RBNs so that it doesn't launch with the same offering

jonathan-innis avatar Jul 10 '23 21:07 jonathan-innis

Is there any progress on this feature? We sometimes receive spot interruptions of multiple instances at the same time and are not able to spin up enough new instances within 2 minutes, hence leading to some of pods forced and ungracefully shutdown and resulting in some disruption. We'd like to handle the capacity recommendation to ease this issue.

dogzzdogzz avatar Nov 13 '23 09:11 dogzzdogzz

I have produced and tested a WIP PR with this feature. Linked above.

As mentioned in this issue there should be a way to enable this on a per NodePool basis. That will require updates to the CRDs which is not included in the above change (but some sample code is provided). It would be good to start some conversations around what the API for this feature would look like.

My PR proposes the following but I highly doubt that is the best design.

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  disruption:
    handleRebalanceRecommendations: true
...

hintofbasil avatar Jul 12 '24 15:07 hintofbasil

Any progress on this feature? We’re running into similar issues as others with workarounds not really working:

  • Like @dogzzdogzz, we sometimes experience multiple spot interruptions at once and can’t spin up replacements within 2 minutes, which leads to ungraceful pod shutdowns and service disruptions.
  • Like @ltellesfl, we’re also facing challenges when using NTH alongside Karpenter. NTH drains nodes immediately after a rebalance recommendation, but because it doesn’t share Karpenter’s scheduling cache, Karpenter still launches a replacement node of the same type. This causes the new node to be drained again, creating a loop.

Linking #7949. There was some response in how to handle your first point @HiphopopotamusRhymenoceros .

gomesdigital avatar Oct 29 '25 21:10 gomesdigital