cluster-api
cluster-api copied to clipboard
Support for DaemonSet eviction when draining nodes
(I'm not sure if this feature request is large enough to require the CAEP process. If it is please let me know.)
User Story
As a user I would like to some mechanism to have my DaemonSet pods gracefully terminated when draining nodes for deletion so that those pods can complete their shutdown process.
Detailed Description
Currently Cluster API uses the standard kubectl drain ignoring all DaemonSets (link). I would like some way to have my DaemonSet pods also gracefully terminated as part of the node deletion process.
Anything else you would like to add:
While investigating whether this is currently possible I saw that Cluster Autoscaler provides a mechanism to control DaemonSet draining. I'm planning to make use of this in the interim but it would be nice to also have the draining happen for when nodes are not drained by Cluster Autoscaler (e.g. for cluster upgrades, etc.).
I also looked into the graceful node shutdown feature but in my case the pod drain time is quite long (could be 30 minutes or longer) and I'm not sure the feature would work for such long termination times, especially in EC2. I don't think EC2 will let you stall instance termination for such a long time. It's hard to find any documentation on how long an EC2 instance can inhibit the shutdown but I did see this saying typically 10 minutes is the max.
The other thing I saw while investigating this is that Cluster API machine deletion has a pre-terminate hook. It seems like it might be possible to implement evicting DaemonSet pods by making a custom Hook Implementing Controller (HIC). Is that the preferred way to implement something like this? If so I can close this feature request and look into making the HIC.
/kind feature
Kind of sounds like https://github.com/kubernetes/kubernetes/issues/75482 :/
Kind of sounds like kubernetes/kubernetes#75482 :/
Yes, I think if https://github.com/kubernetes/kubernetes/issues/75482 were implemented it could potentially be used to implement this feature request.
/milestone Next /kind proposal
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/lifecycle frozen based on experience, we are slowly surfacing knobs for machine deletion/drain, and this falls into this category. As documented above this could require a small proposal
/help
@fabriziopandini: This request has been marked as needing help from a contributor.
Guidelines
Please ensure that the issue body includes answers to the following questions:
- Why are we solving this issue?
- To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
- Does this issue have zero to low barrier of entry?
- How can the assignee reach out to you for help?
For more details on the requirements of such an issue, please see here and ensure that they are met.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.
In response to this:
/lifecycle frozen based on experience, we are slowly surfacing knobs for machine deletion/drain, and this falls into this category. As documented above this could require a small proposal
/help
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/triage accepted
This feature might be eventually supported with Declarative Node Maintenance: https://github.com/kubernetes/enhancements/pull/4213
/priority backlog
Do we know how cluster-autoscaler implemented this feature?
In general the DaemonSet controller will add a toleration for the Unschedulable taint to all DaemonSet Pods (https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/#taints-and-tolerations).
So while it's possible to evict DaemonSet Pods they will just immediately be re-created (because "cordon" basically doesn't work because of the toleration).
I would guess they maybe added a cluster-autoscaler-specific taint to the Node?
In general it would be better if evicting DaemonSet Pods would be cleanly supported in core Kubernetes first.
Took a quick look at autoscalers code.
For me it looks like they don't handle that the daemonset controller schedules a new pod. They ignore that fact but seem to evict the running pods once and seem to have the race that:
- if it was successful it just continues deletion (not listing pods again, just the existing ones before evicting ds pods are gone)
- if it was not successful it tries again, with then maybe other pods
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/scaledown/actuation/group_deletion_scheduler.go#L100-L116