karpenter icon indicating copy to clipboard operation
karpenter copied to clipboard

docs: RFC for disruption.terminationGracePeriod feature

Open wmgroot opened this issue 1 year ago • 46 comments

Design for #743 Implementation PR: https://github.com/kubernetes-sigs/karpenter/pull/916

Description RFC for support for disruption.forceDrainAfter (nodeDrainTimeout)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

wmgroot avatar Dec 01 '23 08:12 wmgroot

CLA Signed

The committers listed above are authorized under a signed CLA.

  • :white_check_mark: login: wmgroot (f66a230e183ee809f88585f493e63b37602c8c61)

Welcome @wmgroot!

It looks like this is your first PR to kubernetes-sigs/karpenter 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/karpenter has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. :smiley:

k8s-ci-robot avatar Dec 01 '23 08:12 k8s-ci-robot

Hi @wmgroot. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Dec 01 '23 08:12 k8s-ci-robot

/test ls

jackfrancis avatar Dec 01 '23 21:12 jackfrancis

@jackfrancis: The specified target(s) for /test were not found. The following commands are available to trigger optional jobs:

  • /test pull-karpenter-test-1-26
  • /test pull-karpenter-test-1-27
  • /test pull-karpenter-test-1-28
  • /test pull-karpenter-test-1-29

Use /test all to run all jobs.

In response to this:

/test ls

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Dec 01 '23 21:12 k8s-ci-robot

/test pull-karpenter-test-1-26

jackfrancis avatar Dec 01 '23 21:12 jackfrancis

/test pull-karpenter-test-1-29

jackfrancis avatar Dec 01 '23 21:12 jackfrancis

@wmgroot: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-karpenter-test-1-26 0ab6f4106cecc6b36bef89445fd7306ba3f670d8 link false /test pull-karpenter-test-1-26
pull-karpenter-test-1-29 0ab6f4106cecc6b36bef89445fd7306ba3f670d8 link false /test pull-karpenter-test-1-29

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot avatar Dec 01 '23 21:12 k8s-ci-robot

@wmgroot thank you for letting me pollute your PR thread w/ some prow tests, just testing out some recent additions, nothing for you to worry about here :)

jackfrancis avatar Dec 01 '23 21:12 jackfrancis

Pull Request Test Coverage Report for Build 10116371278

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.02%) to 78.097%

Files with Coverage Reduction New Missed Lines %
pkg/controllers/provisioning/scheduling/nodeclaim.go 2 89.13%
<!-- Total: 2
Totals Coverage Status
Change from base Build 10101395280: -0.02%
Covered Lines: 8850
Relevant Lines: 11332

💛 - Coveralls

coveralls avatar Dec 10 '23 05:12 coveralls

Overall looks good to me

garvinp-stripe avatar Dec 13 '23 21:12 garvinp-stripe

Following up on https://github.com/kubernetes-sigs/karpenter/pull/834#discussion_r1438324591

I can see the benefit of a NodePool-level setting to define grace behavior. I would have an enum with 2 values defined:

  • RespectNodeAndPodGracePeriods
  • TreatNodeClaimGracePeriodAsDefinitive

These are wordy but hard to misunderstand. Then, Karpenter uses this setting to influence what gracePeriodSeconds to specify for a DELETE of the Node, and also uses that for its internal drain-and-finalize behavior.

Cluster operators can serve different developer needs by having different node pools with different settings. If you set RespectNodeAndPodGracePeriods, you're allowing anything that can get a pod admitted to delay a node-level operation, and maybe that's OK. Separate mechanisms can limit which namespaces get to use which NodePools.

Whilst the NodePool API is still beta, we could limit the enum to just one value, and have a plan to implement the other enum option ahead of v1 / general availability.

To help people avoid scheduling Pods to nodes that will enforce termination before their app might be ready for that, I think we should advise them to prefer (or require) specific NodePools using nodeAffinity, if that outcome is important to them.

sftim avatar Dec 29 '23 17:12 sftim

Adding a NoExecute taint at the end of the terminationGracePeriod may be a better k8s-native way to handle this logic too.

Mmm, yes. I would implement that by defining a field on NodeClaim to specify the duration between:

  • the point in time at which this taint appears
  • the point at which Karpenter will enforce shutdown / termination

Then, you put that field in your NodePool's template. I think for this one we could have a default / heuristic behavior: the Pod is already supposed to have drained, and any pod that wants to tolerate that taint can specify how long it'd like to tolerate it for.

Does this mean two taints, then?

  • one taint to mean “Node is draining for termination, don't schedule here”
  • one taint to mean “Node is draining for termination, stop executing now”

or the same taint twice (NoSchedule and NoExecute)?

This also has potential to tie in to Declarative Node Maintenance if the KEP moves forward. If the declarative node maintenance API were available in a cluster, I think Karpenter should discover and use it (suppressing its built-in drain logic).

sftim avatar Dec 29 '23 17:12 sftim

I can see the benefit of a NodePool-level setting to define grace behavior. I would have an enum with 2 values defined:

Rather than an enum with implementation options, could we add a new separate timeout field later for the RespectNodeAndPodGracePeriods behavior, such as NodeClaim.spec.softTerminationGracePeriod?

I'm not sure I've seen any Karpenter users requesting this style of behavior, and I'm hesitant to complicate the design of this feature for a future addition that isn't something we think we're likely to need.

wmgroot avatar Dec 29 '23 21:12 wmgroot

I can see the benefit of a NodePool-level setting to define grace behavior

I'd like to avoid opening up too many options here and just enforcing a single behavior that achieves the persona isolation and forceful drain logic that we think makes the most semantic sense. I'd argue that it's on the cluster admin who is configuring the terminationGracePeriod to ensure that this value is above the terminationGracePeriodSeconds of pods that it manages OR that there be enough coordination between users and cluster admins to ensure that stateful pods aren't forcefully dropped before they can safely drain. I don't think we should give users tools to shoot themselves in the foot if they can avoid it, but we should also give users enough knobs to accept the risk of what they are doing and do what they need to do to fulfill their requirements. Practically, this terminationGracePeriod should be set high enough by cluster admins that it's nearly impossible that something hasn't been misconfigured that is blocking the node from termination

Does this mean two taints, then

I think the way we are currently thinking about handling this (could consider making this more cohesive with the Declarative Node Maintenance KEP going forward) would be to add the karpenter.sh/disruption=disrupting:NoSchedule taint like we currently do when we are actively disrupting the node (this is similar to the cordoning logic that currently happens during a standard node drain, except it allows us to drain DaemonSets as well). Then, once we hit our node terminationGracePeriod, we add the [out-of-service] taint that already has built-in force logic built-in to the control plane for rescheduling pods and detaching volumes based on NoExecute. Here, once we add this taint, we should respect any tolerationSeconds that's specified and including waiting for all pods to be removed from the node that either 1) Don't tolerate the NoExecute taint or 2) Tolerate the NoExecute taint, but only for a specified tolerationSeconds. After all of these pods are removed, and we are only left with pods that tolerate the taint, we force delete the Node.

This allows some nice flexibility since the default behavior just terminates the node in a forceful manner once it exceeds the grace period, but still allows standard taint toleration flexibility around NoExecute. This also (IMO) makes the most semantic sense since it will forcefully start draining the node past its grace period (though it will still allow one-off exceptions to tolerate taints a bit longer with tolerationSeconds).

The one thing we need to think about here, though: Does this still provide the same persona-based isolation that this feature is trying to enforce? Respecting the tolerationSeconds for a pod toleration for out-of-service still means that there needs to be some policy enforcement to ensure that users aren't setting this to an unnecessary high value or just to ensure that they aren't tolerating this taint at all and only let cluster admins selectively tolerate this taint.

jonathan-innis avatar Jan 01 '24 20:01 jonathan-innis

I presume that any Pod that tries to tolerate everything gets shut down, ordered based on its priority, at the last possible moment (at least on Linux). That outcome matches our understanding of the intent of whoever defined that Pod. It might be a surprise to the app but it people wanted more notice, they shouldn't tolerate all the taints that could have given them that extra notice.

sftim avatar Jan 02 '24 09:01 sftim

@wmgroot @sftim I was thinking more about the ordered drain problem with this force drain operation, it seems like we can collapse the behavior that we want here with the force drain logic into the following tenents:

  1. A cluster admin can set terminationGracePeriod on their NodeClaims; this value, is the maximum time that the NodeClaim can be alive after the deletionTimestamp is set. This effectively means the maximum amount of time that the Node is draining, but just because of eventual consistency and processing, there might be a 1-2s wiggle room in here.
  2. Karpenter will drain pods after the deletion timestamp is set with a requeue mechanism and with the following batching:
    1. Any pod who has a terminationGracePeriodSeconds <= Remaining time left in the terminationGracePeriod of the NodeClaim gets force drained. We won't call the eviction API in this case since if the following statement is true we don't care about blocking PDBs. We just want to give the pod the maximum amount of time to spin-down.
    2. Non-Critical Non-Daemonset pods are added to the set of evictable pods in this eviction loop
    3. If there are no pods in ii, add Non-Critical Daemonset pods to the set of evictable pods in this eviction loop
    4. If there are no pods in iii, add Critical Non-Daemonset pods to the set of evictable pods in the eviction loop
    5. If there are no pods in iv, add Critical Daemonset pods to the set of evictable pods in the eviction loop

Once there are no more pods or the terminationGracePeriod of the NodeClaim is exhausted, we add the non-graceful termination taint to the Node. If there are any pods that tolerate this taint remaining, we wait for these pods to be removed and terminated off of the node. Once it's only pods that tolerate this NoExeucte taint remaining, we terminate the instance and remove the finalizer from the NodeClaim.

jonathan-innis avatar Jan 15 '24 20:01 jonathan-innis

+1 to this flow @jonathan-innis, but as an addendum I think we need to shift this all before the deletion timestamp is placed on the node claim. Deleting the node claim is part of the crd removal cascade delete flow, which should clean up owned resources without blocking on draining. This is essential to support uninstall scenarios where alternative capacity is not available.

ellistarn avatar Jan 16 '24 01:01 ellistarn

@wmgroot @sftim I was thinking more about the ordered drain problem with this force drain operation, it seems like we can collapse the behavior that we want here with the force drain logic into the following tenents:

1. A cluster admin can set `terminationGracePeriod` on their NodeClaims; this value, is the maximum time that the NodeClaim can be alive after the deletionTimestamp is set. This effectively means the maximum amount of time that the Node is draining, but just because of eventual consistency and processing, there might be a 1-2s wiggle room in here.

2. Karpenter will drain pods after the deletion timestamp is set with a requeue mechanism and with the following batching:
   
   1. Any pod who has a `terminationGracePeriodSeconds` <= Remaining time left in the `terminationGracePeriod` of the NodeClaim gets force drained. We won't call the eviction API in this case since if the following statement is true we don't care about blocking PDBs. We just want to give the pod the maximum amount of time to spin-down.
   2. Non-Critical Non-Daemonset pods are added to the set of evictable pods in this eviction loop
   3. If there are no pods in ii, add Non-Critical Daemonset pods to the set of evictable pods in this eviction loop
   4. If there are no pods in iii, add Critical Non-Daemonset pods to the set of evictable pods in the eviction loop
   5. If there are no pods in iv, add Critical Daemonset pods to the set of evictable pods in the eviction loop

Once there are no more pods or the terminationGracePeriod of the NodeClaim is exhausted, we add the non-graceful termination taint to the Node. If there are any pods that tolerate this taint remaining, we wait for these pods to be removed and terminated off of the node. Once it's only pods that tolerate this NoExeucte taint remaining, we terminate the instance and remove the finalizer from the NodeClaim.

I think anything smarter than just adding the official k8s taint can be added in a further update to this behavior if a need is determined for it. The PR I have open is dead simple and works in my testing. I think any kind of special handling of the order of force deletes on pods prior to setting the taint may be over-engineering.

Testing what behavior occurs if a pod tolerates all taints to ensure it doesn't block node deletion is worth because that's a very common possibility.

wmgroot avatar Jan 16 '24 04:01 wmgroot

but as an addendum I think we need to shift this all before the deletion timestamp is placed on the node claim

This one I'm less sure about. I agree that we need a mechansim to be able to force the delete of all NodeClaims for uninstall scenarios, but I'm not sure that removing the finalizer is the right way to go about it. I could likewise see that you orchestrate an update that sets the terminationGracePeriod of all NodeClaims to 0s and then performs the NodeClaim deletion as another way to orchestrate the deletion of the entire CRD without hanging. It feels strange to me that we would be adding a terminationGracePeriod for the NodeClaim but then not actually be performing a deletion of the NodeClaim until after the grace period. This is counter to how grace period works elsewhere in K8s.

jonathan-innis avatar Jan 16 '24 05:01 jonathan-innis

I think anything smarter than just adding the official k8s taint can be added in a further update to this behavior if a need is determined for it

I disagree here. A lot of the ordering logic that's built-in above is already handled by the current termination controller. The only difference that is really being proposed with this change, is to ensure that pods receive appropriate heads-up before the terminationGracePeriod is met and the pods are force-drained due to non-graceful node shutdown taint. If we don't add the ability to terminate pods and respect their grace period, pods may never receive an eviction call due to fully blocking PDBs and interrupted with no time to clean-up.

jonathan-innis avatar Jan 16 '24 05:01 jonathan-innis

I think anything smarter than just adding the official k8s taint can be added in a further update to this behavior if a need is determined for it

I disagree here. A lot of the ordering logic that's built-in above is already handled by the current termination controller. The only difference that is really being proposed with this change, is to ensure that pods receive appropriate heads-up before the terminationGracePeriod is met and the pods are force-drained due to non-graceful node shutdown taint. If we don't add the ability to terminate pods and respect their grace period, pods may never receive an eviction call due to fully blocking PDBs and interrupted with no time to clean-up.

I think the extra eviction behavior you're asking for would be better implemented as a general enhancement to this feature instead of being limited to Karpenter behavior. https://kubernetes.io/docs/concepts/architecture/nodes/#non-graceful-node-shutdown

wmgroot avatar Jan 16 '24 05:01 wmgroot

I think the extra eviction behavior you're asking for would be better implemented as a general enhancement to this feature instead of being limited to Karpenter behavior

How would that function with the out-of-service taint? The pods need to be aware that they need to be removed before the out-of-service taint is added in this case. I agree that we shouldn't over-engineer the solution here but I also want to be cautious that pods with blocking PDBs that are force terminated still get the requisite time that they have requested to do their clean-up if we have the chance to be smarter about their termination behavior.

jonathan-innis avatar Jan 16 '24 06:01 jonathan-innis

How would that function with the out-of-service taint? The pods need to be aware that they need to be removed before the out-of-service taint is added in this case. I agree that we shouldn't over-engineer the solution here but I also want to be cautious that pods with blocking PDBs that are force terminated still get the requisite time that they have requested to do their clean-up if we have the chance to be smarter about their termination behavior.

How long is the "requisite time" that a pod needs to do cleanup? terminationGracePeriodSeconds is user-controlled and could be set to hours or days.

wmgroot avatar Jan 16 '24 06:01 wmgroot

How long is the "requisite time" that a pod needs to do cleanup? terminationGracePeriodSeconds is user-controlled and could be set to hours or days

Right, so the time that the pod gets to drain is equal to the following equation:

podTerminationTime = min(pod.terminationGracePeriodSeconds, nodeclaim.terminationGracePeriod)

This ensures that if the nodeClaim terminationGracePeriod is ever less than the pod termination grace period, those pods will get force drained and start terminating immediately since they have requested such a long grace period. This means they might not get the full grace period, but this ensures that the NodeClaim is fully deleted by the time the cluster admin asked for

jonathan-innis avatar Jan 16 '24 06:01 jonathan-innis

Isn't that already how it works without any extra logic that references the pods' terminationGracePeriodSeconds?

When a Node or NodeClaim is deleted, it starts a drain on the node, which attempts to evict a pod and puts it into terminating status.

A. If the pod's terminationGracePeriodSeconds is less than the NodeClaim's terminationGracePeriod, it will be forcibly terminated before the out-of-service taint is applied. B. If it's greater (and the pod doesn't terminate until it reaches it's maximum tGPS), the out-of-service taint will be applied first, forcibly removing the pod before it can reach its tGPS, but still giving the pod the full NodeClaim tGP to terminate.

I'm not understanding how what you're proposing differs from this behavior.

wmgroot avatar Jan 16 '24 06:01 wmgroot

I could likewise see that you orchestrate an update that sets the terminationGracePeriod of all NodeClaims to 0s

Giving/requiring installers to have write permissions to NodeClaims is likely a nonstarter. Alternatively, you could have the finalizer attempt to introspect the CRD to check if it's a cascading delete, but this is another permissions boundary issue. I don't really see a good alternative to cascading delete.

Further triggering cordon/drain via a taint as @wmgroot states has precedence in k8s, though I think we have flexibility on whether or not we must use the upstream taint. I could see an argument that it doesn't meet our needs, but would need to dig in a bit.

What's your concern with using taints as the driver of cordon/drain?

ellistarn avatar Jan 16 '24 07:01 ellistarn

I think the extra eviction behavior you're asking for would be better implemented as a general enhancement to this feature instead of being limited to Karpenter behavior.

I can see this as a long term approach, but I don't think we necessarily want to block on upstreaming in the interim.

ellistarn avatar Jan 16 '24 07:01 ellistarn

Another thing I'd like to discuss (maybe I'm missing some discussion before this), is where did we land for configuration surface?

My thought is that we shouldn't make this be driftable for nodes, even if that's how deployments and pods work. Pod restarts are inherently much less expensive than node starts, and I think this is the place to draw the line. I think adding the terminationGracePeriodDuration into the nodePool.Spec.Disruption block is the best place for this to live, we can infer from the NodePool's setting what the value should be here, and punt on adding this as a field into the nodeclaim itself.

njtran avatar Jan 16 '24 23:01 njtran

we can infer from the NodePool's setting what the value should be here, and punt on adding this as a field into the nodeclaim itself

After some discussion with @njtran, I'm +1 on this proposal. We don't have a ton of utility for standalone NodeClaims today. We can always add this as an additional field on the NodeClaims down-the-line.

jonathan-innis avatar Jan 17 '24 05:01 jonathan-innis