karpenter icon indicating copy to clipboard operation
karpenter copied to clipboard

Node Repair

Open jbouricius opened this issue 2 years ago • 49 comments

Tell us about your request Allow a configurable expiration of NotReady nodes.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? I am observing some behavior in my cluster where occasionally nodes fail to join the cluster, due to some transient error in the kubelet bootstrapping process. These nodes stay in NotReady status. Karpenter continues to assign pods to these nodes, but the k8s scheduler won't schedule to them, leaving pods in limbo for extended periods of time. I would like to be able to configure Karpenter with a TTL for nodes that failed to become Ready. The existing configuration spec.provider.ttlSecondsUntilExpiration doesn't really work for my use case because it will terminate healthy nodes.

Are you currently working around this issue? Manually deleting stuck nodes.

Additional context Not sure if this is useful context, but I observed this error on one such stuck node. From /var/log/userdata.log:

Job for sandbox-image.service failed because the control process exited with error code. See "systemctl status sandbox-image.service" and "journalctl -xe" for details.

and then systemctl status sandbox-image.service:

  sandbox-image.service - pull sandbox image defined in containerd config.toml
   Loaded: loaded (/etc/systemd/system/sandbox-image.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2022-06-28 18:47:42 UTC; 2h 9min ago
  Process: 4091 ExecStart=/etc/eks/containerd/pull-sandbox-image.sh (code=exited, status=2)
 Main PID: 4091 (code=exited, status=2)

From reading others issues it looks like this AMI script failed, possibly in the call to ECR: https://github.com/awslabs/amazon-eks-ami/blob/master/files/pull-sandbox-image.sh

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

jbouricius avatar Jun 29 '22 20:06 jbouricius

related: https://kubernetes.slack.com/archives/C02SFFZSA2K/p1656533825820419?thread_ts=1656532759.059699&cid=C02SFFZSA2K

ellistarn avatar Jun 30 '22 03:06 ellistarn

We recently started using Karpenter for some batch jobs and are running into this as well, where nodes that get stuck on NotReady cause the pod to never get scheduled. The underlying node's reason turned out to be the subnets being full and the CNI pods never coming up as a result, but regardless of the cause, big +1 to having a configurable way in Karpenter to ensure bad nodes can get terminated automatically if it never comes up within some TTL.

htoo97 avatar Jul 29 '22 02:07 htoo97

I'd love to get this prioritized. It should be straightforward to implement in the node controller.

ellistarn avatar Jul 29 '22 16:07 ellistarn

If it's not being worked on internally yet, I can take a stab at this!

htoo97 avatar Jul 29 '22 17:07 htoo97

Related:

  • https://github.com/aws/karpenter/issues/2570
  • https://github.com/aws/karpenter/issues/2544
  • https://github.com/aws/karpenter/issues/2489
  • https://github.com/aws/karpenter/issues/2439
  • https://github.com/aws/karpenter/issues/1415

ellistarn avatar Dec 19 '22 16:12 ellistarn

@tzneal makes a good point here that this auto-repair feature can potentially get out of hand if each node provisioned becomes NotReady, in case of something like bad userdata configured at the Provisioner / NodeTemplate level.

This could possibly also be an issue with EC2 service outages.

Maybe you would have to implement some sort of exponential backoff per Provisioner to prevent this endless cycle of provisioning nodes that will always come up as NotReady.

korenyoni avatar Jan 05 '23 13:01 korenyoni

:+1: We're occasionally seeing cases when a node has been launched but never properly initialized (failing to get karpenter.sh/initialized=true) - as these nodes are treated as a capacity already arranged to be available in future, they can prevent cluster expansion and cause pods to be permanently stuck (constantly nominated by karpenter to run on a node that will never complete the initialization)

wkaczynski avatar Jan 16 '23 16:01 wkaczynski

@wkaczynski, it's a bit tricky.

  • If we delete nodes that fail to initialize, users will have a hard time debugging
  • If we ignore nodes the fail to initialize, you can get runaway scaling.

ellistarn avatar Jan 17 '23 19:01 ellistarn

We currently have a support ticket open with AWS for occasional bottlerocket boot failures on our Kubernetes nodes. The failure rate is very low and it's important that we are able to get logs off a node and potentially take a snapshot of the volumes. In this scenario it's vital that we can opt out of Karpenter auto removing the node. I'd be in favor of this at least being configurable so users can decide.

dschaaff avatar Jan 18 '23 00:01 dschaaff

@njtran re: the behaviors API.

ellistarn avatar Jan 18 '23 00:01 ellistarn

it's vital that we can opt out of Karpenter auto removing the node

If we delete nodes that fail to initialize, users will have a hard time debugging

I also think that if we do decide to delete nodes that failed to initialize, there should be an option to opt-out to be able to debug (or if we don't delete by default - an opt-in option to enable the cleanup).

The cleanup does not even need to be a provisioner config - initially, until there it a better way to address this issue, it could be enabled via a helm chart value and exposed as either a cm setting, command line option, or an env var.

Another thing is - are these nodes considered as in-flight indefinitely ? If yes, is there currently an option for at least the in-flight status timeout ? If there isn't an option for these nodes to no longer be considered in-flight, do I understand correctly that this can effectively block the cluster expansion even if there is a one-off node initialization failure (that we're sometimes experiencing with aws).

If we ignore nodes the fail to initialize, you can get runaway scaling.

there are cases in which runaway scaling is preferred over a service interruption, would be good to have an opt-in cleanup option

wkaczynski avatar Jan 24 '23 13:01 wkaczynski

Another thing is - are these nodes considered as in-flight indefinitely ?

I like to think about this as ttlAfterNotReady. @wkaczynski, do you think this is reasonable? You could repurpose the same mechanism to cover cases where nodes fail to connect, or eventually disconnect. We'd need to be careful to not kill nodes that have any pods on them (unless they're stuck deleting), since we don't want to burn down the fleet during an AZ outage. I'd also like this to fall under our maintenance windows controls as discussed in aws/karpenter#1738.

ellistarn avatar Jan 24 '23 18:01 ellistarn

We recently ran into an issue where an EC2 node failed to become "Ready". We reached out to the AWS support team and they mentioned it was an OS level issue where the instance failed to talk to IMDS.

The end result was a bunch of pods waiting to be scheduled because Karpenter thought the node would eventually become "Ready". It was stuck that way for 5 hours before we manually terminated the node.

jalaziz avatar Jan 30 '23 07:01 jalaziz

Another thing is - are these nodes considered as in-flight indefinitely ?

I like to think about this as ttlAfterNotReady. @wkaczynski, do you think this is reasonable?

This seems reasonable but I guess the case where nodes fail to initialize is more clear cut and probably a lot easier to address - these nodes are generally safe to be deleted as there is nothing running on them yet.

In case of clusters with spiky workloads, lack of any solution for nodes that fail to initialize (with the combination of considering them in-flight) can be a regular cause of outages as the cluster will not be expanded until the nodes are manually removed.

I understand that in the worst case - if the node startup issues are permanent (for instance due to misconfiguration):

  • if we just time out their in-flight status, we may end-up with karpenter spinning-up more and more uninitialized nodes
  • if we delete nodes failing to initialize, we may end-up in a create-destroy loop (but at least the uninitialized instances will not be piling up)

You could repurpose the same mechanism to cover cases where nodes fail to connect, or eventually disconnect. We'd need to be careful to not kill nodes that have any pods on them (unless they're stuck deleting), since we don't want to burn down the fleet during an AZ outage.

It makes sense to (eventually) have a unified way of addressing this but I fear that if we aim to start with this unified solution, this will take a lot more time to address (as the Ready -> NotReady cases are not that clear cut) and the failed initialization cases tend to occur more frequently (at least in our case where we have a high node churn).

wkaczynski avatar Jan 30 '23 10:01 wkaczynski

The current plan is that Node Ownership will reduce the occurrence of nodes not registering, and introducing ttlAfterNotRegistered in https://github.com/aws/karpenter-core/pull/191 will also help with the node repair issue. You can track work for Node Ownership and enabling ttlAfterNotRegistered in https://github.com/aws/karpenter-core/pull/176 .

billrayburn avatar Feb 23 '23 20:02 billrayburn

Another option might help with debugging be to introduce an annotation for nodes that will block karpenter from doing anything with them. We've added that to our internal autoscaler and I've used termination protection in the past to block OSS Autoscaler actions so we can keep it around for debugging.

sidewinder12s avatar Mar 04 '23 01:03 sidewinder12s

I would like to mention that aws/karpenter#3428 also brought up the topic of unhealthy nodes not only during initialisation but during normal work.

We so far usually run our workloads via autoscaling groups & LoadBalancer with some kind of health check or autoscaling groups that run some custom script for health check, if this fails the ASG terminates the instance and replaces it with a new node. I currently see no way that Karpenter is able to handle this, right?

DaspawnW avatar Mar 20 '23 16:03 DaspawnW

Can we already use this settings?

ttlAfterNotRegistered: 15m

We did not find it in the current karpenter documentation...

runningman84 avatar Mar 22 '23 14:03 runningman84

Can we already use this settings?

ttlAfterNotRegistered: 15m

We did not find it in the current karpenter documentation...

https://github.com/aws/karpenter-core/pull/176 was only merged on March 8th, a day after 0.27.0 was released.

So I think you need to wait for 0.28.0.

@jonathan-innis can you please confirm?

korenyoni avatar Mar 23 '23 09:03 korenyoni

For example you can see 0.27.0 is still handling only the Node resource and not both v1alpha5.Machine and Node

https://github.com/aws/karpenter-core/blob/7d58c3cee0fa997750d09a43b2037c69437857e3/pkg/controllers/deprovisioning/consolidation.go#L189-L273

Cross reference it with the equivalent unreleased code

https://github.com/aws/karpenter-core/blob/62fe4ac537b8e381bbb11bd344bb2f05850cb619/pkg/controllers/deprovisioning/consolidation.go#L106-L190

korenyoni avatar Mar 23 '23 09:03 korenyoni

Can we already use this settings?

ttlAfterNotRegistered: 15m

We did not find it in the current karpenter documentation...

You won't be able to use this setting yet. The release of this setting is tied to the release of the *v1alpha5.Machine which captures the work that is needed to fix a lot of the issues that are tied to Node Ownership. Once the changes that utilize the *v1alpha5.Machine go in, this feature will be released and updates will be made to the documentation that show how you can use it. For now, it's mentioned in the core chart only as a preparatory measure before the release of the Machine.

For now, we don't have an estimate on when this work will get done but it is being actively worked on and progress should be made here relatively soon. I'll make sure to update this thread when the feature gets formally released.

jonathan-innis avatar Mar 27 '23 16:03 jonathan-innis

Looks like the Machine Migration PR went in! Fantastic work!

Does that mean we will be able to use the new setting as soon as the next release of Karpenter?

maximethebault avatar Apr 29 '23 12:04 maximethebault

@maximethebault Yes! The new timeout mechanism will be 15m for a node that doesn't register to the cluster and should be in the next minor version release.

jonathan-innis avatar May 01 '23 05:05 jonathan-innis

We've also been discussing about extending this mechanism more broadly to surface a ttlAfterNotReady value in the Provisioner, with a reasonable default value and then tearing down a Machine if the Machine hasn't been ready for that period of time and is empty (according to the definition of Karpenter's emtpiness). This should solve a lot of the auto-repair issues that users are hitting, either with nodes never going initialized and never going ready or transient nodes that sit in a NotReady state for an extended period of time

jonathan-innis avatar May 03 '23 17:05 jonathan-innis

We've also been discussing about extending this mechanism more broadly to surface a ttlAfterNotReady value in the Provisioner, with a reasonable default value and then tearing down a Machine if the Machine hasn't been ready for that period of time and is empty (according to the definition of Karpenter's emtpiness). This should solve a lot of the auto-repair issues that users are hitting, either with nodes never going initialized and never going ready or transient nodes that sit in a NotReady state for an extended period of time

Sounds good!

What would happen in the scenario described in this issue though? As the pods are in a terminating state but still there, I suppose the node will not be considered empty?

maximethebault avatar May 04 '23 10:05 maximethebault

As the pods are in a terminating state but still there, I suppose the node will not be considered empty

If the pods are terminating, then they will be considered for emptiness. You can see our pod filtering logic here

jonathan-innis avatar May 08 '23 17:05 jonathan-innis

@jonathan-innis Does the newly released v0.28.0 actually contain the 15m timeout? It didn't seem to be mentioned anywhere.

marksumm avatar Jun 15 '23 08:06 marksumm

hello there, we also get this problem and want to know when the ttlAfterNotRegistered configuration will be ready?

gaussye avatar Jun 27 '23 07:06 gaussye

I also searched for this config, and it seems that this PR https://github.com/aws/karpenter-core/pull/250/files changed it to hardcoded value of 15min that appears in https://github.com/aws/karpenter-core/blob/main/pkg/controllers/machine/lifecycle/liveness.go#L38 (fine for my use case, just commenting for others that are searching this)

nirroz93 avatar Jun 27 '23 08:06 nirroz93

@nirroz93 Thanks for pointing it out. If I am reading the change history correctly, then the default 15m timeout should already have been included since v0.27.1 of Karpenter (which uses the same version of karpenter-core). This is bad news for us, since the last example of a stuck node post-dates an upgrade to v0.27.3 by almost a month.

marksumm avatar Jun 27 '23 09:06 marksumm