karpenter
karpenter copied to clipboard
Drift Detection Without Deprovisioning
Tell us about your request
I would like to be able to enable drift detection and the consequent node labeling, but disable deprovisioning based on the label.
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
My organization periodically builds new AMIs (security patches, etc.) and after testing, tags them as ready for production. Karpenter picks up the new AMI automatically, which is great. What I don't want to have happen is all of my production clusters reprovisioning their workers simultaneously when that happens. You can imagine, for instance, if there was a problem with that AMI that I'd lose my redundancy between regions at the worst possible time. I need more control over when and how that deprovisioning happens.
What I'd like to be able to do is be able to easily see (kubectl get nodes -l) which of my nodes if any have drifted, but handle the deprovisioning myself.
Are you currently working around this issue?
When I know a new AMI is available I capture the current set of nodes and deprovision them with a script. This would include any nodes that hadn't actually drifted
Additional Context
No response
Attachments
No response
Community Note
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Thanks for the feature request! Drifted nodes should be deprovisioned serially, and we don't deprovision a node if the needed replacement node doesn't initialize and become ready. I have a couple of questions so I can hope to understand your use-case.
- Are you seeing something other than this? If not, is the deprovisioning process too quick?
- Is there a reason that you'd like to tag AMIs as ready for production but not want to see those changes rolled out to your instances?
- In addition, what do you consider the "worst time possible"? Are there specific times that you don't want deprovisioning?
Thanks for the quick response.
- Are you seeing something other than this? If not, is the deprovisioning process too quick?
I haven't actually tried it yet. I'm only just starting to switch to Karpenter from Cluster Autoscaler, and in my initial configuration I have not enabled the drift feature. I have a very conservative recycle script that waits to ensure that all pods are ready, not just the new nodes. My concern is that even though Karpenter will act serially and ensure that the nodes are ready that it won't have any context that the applications are healthy beyond restrictions from pod disruption budgets. Without checking pods, you can end up with a bunch of impaired applications that only have the minimum number of pods permitted by their PDB and no ability to scale up. I've had this happen in the past when recycling for new AMIs (not with Karpenter), and adding the pod readiness check to our recycle script has largely immunized us from that problem.
- Is there a reason that you'd like to tag AMIs as ready for production but not want to see those changes rolled out to your instances?
I do want those changes rolled out, but only one cluster at a time, as conservatively as possible, and preferably during my team's working hours. I realize that newly spun up capacity will use the new AMI, and that's fine. If it turns out that there's a defect in that AMI I'd much rather find that out without deliberately deprovisioning the working nodes and at the same time not having another working cluster to fall back to (see the next question)
- In addition, what do you consider the "worst time possible"? Are there specific times that you don't want deprovisioning?
I run multiple EKS clusters across regions, mostly running identical workloads and sharing the load / supporting failover. If something goes horribly wrong while recycling the cluster in us-west, the cluster in us-east can take over for all of the traffic. But if I'm recycling all of the clusters at the same time, and there's something wrong with that AMI, I'll have nowhere to send the traffic. So by "worst time possible" in this case I mean recycling a cluster without having the other cluster pristine for failover purposes.
The other scheduling element is that it's possible that the new AMI will be tagged by a team in another time zone and the recycles will kick off in the middle of the night for my team, and again if something goes wrong that will be very inconvenient.
So one workaround I can do for the scheduling is to keep the drift feature disabled until an operator is available to monitor the deprovisioning, redeploy Karpenter to one cluster with it enabled, wait for the process to complete, redeploy with it disabled again, and repeat for the other cluster. This solves the scheduling problem, but it doesn't allow me to use my more conservative script that waits for pods to become ready before moving onto the next node. It also retains the problem of not knowing which nodes actually need to be recycled, since I don't have the convenient drifted label as a selector.
Same here...I have already done some eks cluster upgrades and I think that karpenter replaces the nodes too fast. E.g. if some pods take longer to start, this can lead to a short outage, although the pods are distributed on different nodes by affinities or topologies.
It would be nice too have more control over the drift detection, e.g. a parameter for sleeps,...
This seems like a reasonable ask. We've discussed in the past that we could mark a node with karpenter.sh/voluntary-disruption=drifted | expired | underutilized as soon as we detect it, and then have a secondary algorithm that processes the disrupted nodes in a safe way.
I think this should be covered by https://github.com/aws/karpenter-core/issues/753. If a machine is blocked from scaling down, we should still add the label when we detect the disruption. Similarly, if we detect that the disruption is no longer applicable (e.g. the amiselector was rolled back), we'd simply remove the label. These labels might also help with the visibility asked for here: https://github.com/aws/karpenter/issues/3638.
@njtran, @jonathan-innis WDYT?
Also -- just brainstorming here, but I think that might enable us to collapse our multi-node-consolidation analysis to be multi-node-voluntary-disruption.
Sorry for the late response on this, must've passed through my notifications. As a whole I'd rather not have drift detection without deprovisioning built directly into the feature, as the feature is meant to ensure that Karpenter's nodes are deterministic representations of the nodes. I do think that aws/karpenter-core#753 would be a more robust way to ensure that users who have multiple clusters can control deprovisioning without thinking too hard about architecting their rollouts. I'm considering closing this in favor of aws/karpenter-core#753 as it's a time-based control we can implement. Thoughts? @SapientGuardian
There are a few different implementations discussed in aws/karpenter-core#753.
If you're referring to https://github.com/aws/karpenter-core/issues/753, preventing all nodes from deprovisioning until a certain time, irrespective of drift detection (if I understood that correctly), that would be prohibitively expensive for very elastic workloads. We want autoscaling to work, we just don't want to automatically kill off nodes that are in use because they have a stale AMI.
If you're referring to https://github.com/aws/karpenter-core/issues/753 , setting a time window at which drift detection and recycling would occur, that certainly solves the problem of all clusters deprovisioning at the same time. It doesn't address my concerns of how the deprovisioning occurs, though.
If supporting this mark-only approach, it'll be much more useful as a label (which allows use of a watch to observe newly marked nodes) than as an annotation.
I like the idea of being able to make my own controller that detects expired nodes and then removes them using a canary-like approach (remove a fraction, observe the outcome, proceed only if things look healthy). Putting all of that into one thing, Karpenter, would feel like too much complexity.
If supporting this mark-only approach, it'll be much more useful as a label (which allows use of a watch to observe newly marked nodes) than as an annotation.
@sftim, we currently just periodically query the cluster for deprovisionable nodes (in this case, the ones with the disruption annotation), as we can't always act immediately on a node that needs deprovisioning (since the deprovisioning process is serial at the moment).
At the moment for aws/karpenter-core#753, we're thinking of including a nodeSelector, so you would be able to select the nodes by labels you want to not be deprovisioned, the list of reasons you wouldn't want to see your nodes deprovisioned, and the crontabs that describe when you want this deprovisioning to not occur. We're also thinking of adding a sort of "DisruptionBudget" which could translate to "on weekdays during critical hours for drift for my team-label nodes, only allow 3 nodes to be deprovisioned at once"
@SapientGuardian, let me know if this fits your concerns of "how the deprovisioning occurs"
If I'm understanding correctly, I could create a DisruptionBudget configured to disallow deprovisioning from drift (I assume drift would be among the "reasons you wouldn't want to see your nodes deprovisioned") during certain hours while still permitting deprovisioning from consolidation, and of course I could vary these hours by cluster. That addresses the problem of all clusters killing off nodes at the same time (the "when" problem), but I don't think it addresses the "how" problem. Again, my concern is that Karpenter does not perform all of the same safety checks that my custom deprovisioning does (nor would I expect it to). To address the "how", I need Karpenter to indicate what should be deprovisioned for drift, but let me do the work.
Yes, we're also thinking that it may not make much sense to allow deprovisioning for the budget for different reasons. I have two questions:
- (I do realize this isn't what you opened the issue for, but) What's your use case for allowing more disruption for Drift, and less disruption for Consolidation? We're thinking that it Karpenter shouldn't have to be aware of an enumerated action to know what to do for a node, and users should drive this from their application requirements.
- It sounds like your safety concerns on pod readiness is the main issue here for deprovisioning. I do agree that Karpenter probably can't take all of your safety concerns and put them into the deprovisioning process, but is it purely just waiting for pods to become ready? Is there any assumption that Karpenter takes in the deprovisioning process that we can make better?
Otherwise, I think it'd be an interesting use-case to turn off the deprovisioning controller in favor of a user's own controller, but haven't heard much besides your use-case for implementing this.
What's your use case for allowing more disruption for Drift, and less disruption for Consolidation?
Sorry, that's not what I meant to suggest. Consolidation should always be good, I would just be limiting drift to specific windows in that scenario.
is it purely just waiting for pods to become ready? Is there any assumption that Karpenter takes in the deprovisioning process that we can make better?
In my current process, that's really all I'm doing extra. I can imagine that's not a usable strategy for many clusters (e.g. multi-tenant where you can't go in and fix workloads, or clusters where keeping non-ready pods around is normal), and even on my clusters it can implement unnecessary delay (e.g. when new pods are spun up at the same time as a node recycle) which is why I haven't suggested that Karpenter implement it.
Really it's trying to address the question of "Did the pods that were ready before I evicted it successfully become ready on the nodes to which they were rescheduled?" If Karpenter has a better way of doing that, e.g. tracking the pod from one node to another (which I imagine is tricky because the new pod is spawned with a different name), I could see relying on that instead of my own deprovisioning process.
Got it, that makes sense. Since I haven't heard other similar requests, I'm unsure of how likely it is for us to implement this in the near future. If you're looking to unblock yourself and are willing to contribute, please join the karpenter-dev slack channel! We'd love to discuss more on how this could be done.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
Hello, I have the exact same issue.
My use-case is ephemeral pods that can run for a full week, that should not be killed (they're labeled karpenter.sh/do-not-disrupt).
Sadly karpenter kills these pods because of drift, but it should not, even if there is a new AMI, even if the K8S version doesn't match anymore.
EDIT: I added a pod disruption budget, I hope that will be enough for karpenter to not kill my pods :pray:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.