enhancements
enhancements copied to clipboard
Scale down a deployment by removing specific pods (PodDeletionCost)
Enhancement Description
- One-line enhancement description (can be used as a release note): allow users to decide which pods should be scaled first
- Kubernetes Enhancement Proposal: WIP
- Discussion Link: https://github.com/kubernetes/kubernetes/issues/45509
- Primary contact (assignee): @drbugfinder-work, @ahg-g
- Responsible SIGs: sig-apps
- Enhancement target (which target equals to which milestone):
- Alpha release target (1.21):
- Beta release target (1.22):
- Stable release target (1.24):
- [x] Alpha
- [x] KEP (
k/enhancements
) update PR(s): https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/2255-pod-cost - [x] Code (
k/k
) update PR(s): https://github.com/kubernetes/kubernetes/pull/99163 - [x] Docs (
k/website
) update PR(s): https://github.com/kubernetes/website/pull/26739
- [x] KEP (
- [x] Beta
- [x] KEP (
k/enhancements
) update PR(s): https://github.com/kubernetes/enhancements/pull/2619 - [x] Code (
k/k
) update PR(s): https://github.com/kubernetes/kubernetes/pull/101080, https://github.com/kubernetes/kubernetes/pull/101003 - [x] Docs (
k/website
) update(s): https://github.com/kubernetes/website/pull/28417
- [x] KEP (
Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.
/sig apps
@annajung @JamesLaverack james, you mentioned in the sig-apps slack channel that this enhancement is at risk, can you clarify why? it meets the criteria.
@ahg-g Just to follow up here too, we discussed in Slack and this was due to a delay in reviewing. We've now marked this as "Tracked" on the enhancements spreadsheet for 1.21.
Thank you for getting back to us. :)
Hi @ahg-g,
Since your Enhancement is scheduled to be in 1.21, please keep in mind the important upcoming dates:
- Tuesday, March 9th: Week 9 — Code Freeze
- Tuesday, March 16th: Week 10 — Docs Placeholder PR deadline
- If this enhancement requires new docs or modification to existing docs, please follow the steps in the Open a placeholder PR doc to open a PR against k/website repo.
As a reminder, please link all of your k/k PR(s) and k/website PR(s) to this issue so we can track them.
Thanks!
Hi @ahg-g,
Since your Enhancement is scheduled to be in 1.21, please keep in mind the important upcoming dates:
Tuesday, March 9th: Week 9 — Code Freeze
Tuesday, March 16th: Week 10 — Docs Placeholder PR deadline
- If this enhancement requires new docs or modification to existing docs, please follow the steps in the Open a placeholder PR doc to open a PR against k/website repo.
As a reminder, please link all of your k/k PR(s) and k/website PR(s) to this issue so we can track them.
Thanks!
done.
Hi @ahg-g
Enhancements team is currently tracking the following PRs
- https://github.com/kubernetes/kubernetes/pull/99163
As this PR is merged, can we mark this enhancement complete for code freeze or do you have other PR(s) that are being worked on as part of the release?
Hi @JamesLaverack , yes the k/k code is merged, docs PR still open though.
/stage beta
/milestone v1.22
Hi @ahg-g,
1.22 release team here. After reviewing the kep and the approved PRR, this enhancement is in good shape to meet the 1.22 Enhancements Freeze at 23:59:59 pst on Thursday, May 13.
I have one request to be complete before Enhancements Freeze:
- In the kep.yaml, please update the value for
latest-milestone
to be "v1.22"
Thank you!
@reylejano thanks, sent https://github.com/kubernetes/enhancements/pull/2705
Hi @ahg-g , Thank you for updating the kep.yaml. This enhancement is all set for the 1.22 Enhancements Freeze 🎉
Hello @ahg-g :wave:, 1.22 Docs Shadow here. This enhancement is marked as ‘Needs Docs’ for the 1.22 release.
Please follow the steps detailed in the documentation to open a PR against dev-1.22 branch in the k/website repo. This PR can be just a placeholder at this time and must be created before Fri July 9, 11:59 PM PDT.
Also, take a look at Documenting for a release to familiarize yourself with the docs requirement for the release.
Thank you! 🙏
Hi @ahg-g . Are there any code changes planned for the 1.22 release?
Only the doc PR is waiting to be merged, PR is already up for review.
Hi @ahg-g ,
Looks like only the following k/k PRs were needed for the 1.22 release. Just confirming since code freeze is about 2 weeks away on July 8, 2021.
- kubernetes/kubernetes#101080
- kubernetes/kubernetes#101003
Thank you!
Correct, we are just waiting on the docs PR.
Hi @ahg-g ,
Looks like you're all set for the 1.22 code freeze on July 8, 2021 🎉
I wanted to capture some discussion notes from twitter and others :)
A long time back I proposed a pod probe for this purpose. The pod serves HTTP with a known schema (e.g. JSON { cost uint32 }
). kubelet hits it periodically and publishes the cost into the pod status.
The meaning of cost is opaque, but higher value = higher cost. Within a deployment or RS, we can assume all cost values are comparable (e.g. 100 means the same thing). We CAN NOT assume that across sets.
The pod is the thing that knows best whether it is 10% done or 90% done, or how many clients it has active. For things that don't know it themselves, a controller can do it.
Every time we build a feature on annotations we end up regretting it. I'd really like to reopen this discussion for a "real" API.
Also, as this constitutes an API, KEPs like this really need to have API approvers.
Also, as this constitutes an API, KEPs like this really need to have API approvers.
It did, @liggitt was the reviewer, but I guess we forgot to mention it in the merged KEP.
I wanted to capture some discussion notes from twitter and others :)
A long time back I proposed a pod probe for this purpose. The pod serves HTTP with a known schema (e.g. JSON
{ cost uint32 }
). kubelet hits it periodically and publishes the cost into the pod status.
During the KEP review, there were concerns about updating the cost frequently and causing significant load on the api-server. We mention that explicitly in the Risks and Mitigations section, and warned users against doing it.
The meaning of cost is opaque, but higher value = higher cost. Within a deployment or RS, we can assume all cost values are comparable (e.g. 100 means the same thing). We CAN NOT assume that across sets.
Sorry, I am not sure what you are proposing. But in the current implementation, which takes cost into account only on replicaset scale down, pods from the same replicaset have the same scale. Cost has no meaning across replicasets.
The pod is the thing that knows best whether it is 10% done or 90% done, or how many clients it has active. For things that don't know it themselves, a controller can do it.
Every time we build a feature on annotations we end up regretting it. I'd really like to reopen this discussion for a "real" API.
ACK. Jordan very gently corrected me. I didn't mean to imply anything sneaky happened. :)
What you say about cost is the same as what I am saying. I am mostly arguing that annotations make really bad APIs and we almost always end up regretting it.
On Wed, Aug 11, 2021 at 4:17 PM ahg-g @.***> wrote:
I wanted to capture some discussion notes from twitter and others :)
A long time back I proposed a pod probe for this purpose. The pod serves HTTP with a known schema (e.g. JSON { cost uint32 }). kubelet hits it periodically and publishes the cost into the pod status.
During the KEP review, there was concerns about updating the cost frequently and causing significant load on the api-server. We mention that explicitly in the Risks and Mitigations section https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/2255-pod-cost#risks-and-mitigations, and warned users against doing that.
The meaning of cost is opaque, but higher value = higher cost. Within a deployment or RS, we can assume all cost values are comparable (e.g. 100 means the same thing). We CAN NOT assume that across sets.
Sorry, I am not sure what you are proposing. But in the current implementation, which takes cost into account only on replicaset scale down, pods from the same replicaset have the same scale. Cost has no meaning across replicasets.
The pod is the thing that knows best whether it is 10% done or 90% done, or how many clients it has active. For things that don't know it themselves, a controller can do it.
Every time we build a feature on annotations we end up regretting it. I'd really like to reopen this discussion for a "real" API.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kubernetes/enhancements/issues/2255#issuecomment-897223200, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKWAVH6CXZ7EIIRDTBVCNTT4MAI7ANCNFSM4V7IHR4Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
I came here to mirror a little discussion we had on twitter. @thockin beats me to it! . I want to share additional notes for your consideration:
- I want to start by thank you for this. I do appreciate the work, and i do appreciate the feature. I believe it is a step in the right path.
- I second the view that apis > annotations. apis are versioned, validated, defaulted. Annotation OTH are opaque string maps that are prone to typos and other errors. Yes, we can add hooks to validate them but that comes with additional cost. I just want us to consider what will happen in some future where we or some other folks wanted to extend this feature to include a data structure like value. I am not saying it won't be possible but it will be ugly to work with and even more prone to corner cases.
- The name "pod deletion cost" is a misnomer. Pods (specially of the same replica-set have a little more or a little less equal cost to delete). They do however have different cost to
start and bring to ready
e.g. pods that have cached more data are likely to cost more to start else where. - Now as a
user
. I am thinking ok. I need to tell the system that my pod is becoming more expensive to move or restart. I am not a kubernetes developer nor I care about the SDK. But in order for me to use this feature i need to learn the entire stack (including various AuthN/AuthZ stuff + PATCH v POST + state management stuff such as resource versioning etc.) that I really don't use otherwise and they present no value to me otherwise. This will yield into a higher barrier to entry/feature-use. I can maybe expose some metric so cluster admin can query then use kubectl to update the annotation? But that is even more costly. A better approach for that is:- A unified metric somewhere that system uses (but that implies that the software that manages the metric is same across all clusters and the metric itself is defined the same) and obviously it does mean that i need to learn more as a developer to use this feature. Again the cost to use is too high.
-
A probe. I already know about these. That probe can either be as @thockin described or the stdout of running a simple script that produces a versioned api instance serialized as json/yaml not very different from the json/yaml I already used to create my app. As a developer i already know how to do these things. Hooking this to my
queue depth
or# of inflight requests
or# of cached key value items
can be done using a 100% know-how and expertise i already have.
Now let us widening the scope a bit
The cost of delete (or move) is not just
an RC thing. It is a common information that is used by multiple users for a lot of use cases. For example:
- A cluster auto-scaler trying to identify which node to scale down (nodes with less
pod removal costs
wins). - A cluster admin trying to make a choice between two nodes to take own down for tests/maintenance/what-not.
- Kubelet trying to make a choice between two pods of the same QoS to evict in response to node memory pressure.
If you noticed the deletion - or restart (we really need a better name) - cost
is the closely related reliability impact of voluntary deleting or moving that pod else where. But that cost is not scoped to RC. But scoped the entire pods on cluster expressed as:
- Default Cost. Again this is not QoS. Two pods of the same QoS may have widely different cost to move. It allows the feature to be used even when pods don't report cost.
-
Multiplier. This is a
positive
ornegative
number (or even a %) that multiplies the calculated cost. For example I am a cluster admin. I know that web externally facing tier have a lot of SLA impact thus they present additional reliability challenge when moving them so i will set multiplier to200%
. IOW the multiplier differentiate pods on the same cluster by their nature. - Reported cost (the probe thing discussed above). This overrides default cost when reported.
That means actual cost == default-cost * multiplier || reported-cost * multiplier
favoring reported-cost when available.
As a user (consuming this info) I do have more information to make a highly informed decision that present the least surprises reliability wise.
From an api stand point, it will look like this:
//add pod spec field for default cost
pod.spec.defaultCost = int
//add pod status field for reported-cost
pod.spec.reportedCost = int
Multiplier can follow a cluster wide config by introducing a new type that carries a selector and multiplier value. This well help regulating teams/devs competing for higher multiplier by RBACing the api.
I do have one disagreement about this statement made above
During the KEP review, there were concerns about updating the cost frequently and causing significant load on the api-server. We mention that explicitly in the Risks and Mitigations section, and warned users against doing it.
Kubelet operates with status update logic backed by a heavy cache. Kubelet is super pedantic about when should it patch pod status (i.e., call api-server). So the risk above is not as high as portrayed. There is already logic there that does not perform unnecessarily patch calls and knows how to elevate various statuses changes into one call.
Kal's going big :)
I strongly agree with @khenidak that "pod deletion cost" should be in the Pod status
not an anotation.
Thanks @khenidak for the great suggestions! I believe they all can be addressed as potential followups. I am glad that there is a common understanding that this was a step in the right direction. The current implementation didn't require major changes (most importantly no kubelet changes), yet it already unblocked a number of use cases and started this conversation.
The status field vs annotation was discussed during KEP review. We could revisit this decision and upgrade to a status field in the future, in fact it seems like using an annotation was a better decision in hindsight since there seems to be appetite for something more advanced. Note that the api-server does validate the value of this annotation on update (it is considered a known annotation), not as good as what we can do with a proper status field, but it is something.
I am aware that Kubelet (and most core controllers for that matter) try hard to optimize api server calls. I just feel that this only adds yet another potential scalability concern; for example, consider the case where the pod publishes a fine tuned cost that changes very frequently yet we rarely actually scale up/down the deployment (perhaps priority and fairness can help here by limiting the priority of such updates). In any case, this will be an interesting discussion during PRR review.
@ahg-g I still want to see a couple of things:
1 - The ability for a Pod to (safely) update its own pod-deletion-cost:
Right now this is never going to be safe, as updating the annotation requires a PATCH API call, so running a sidecar container to update the pod-deletion cost would overload the apiserver.
2 - Some level of integration with HorizontalPodAutoscaler:
Right now, HPAs can't be used with pod-deletion-cost, as there is no way to intercept a scale-down event and annotate Pods before it happens.
3 - The possibility of non ReplicaSets taking pod-deletion-cost into account:
Like what @khenidak mentioned with cluster-autoscaler taking the Pod cost into account when deciding which Node to remove.
@ahg-g i would subscribe to iterating on this. But we have to start by moving the field from annotation to a status field :-)
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale