Enhancement Description

One-line enhancement description (can be used as a release note): allow users to decide which pods should be scaled first
Kubernetes Enhancement Proposal: WIP
Discussion Link: https://github.com/kubernetes/kubernetes/issues/45509
Primary contact (assignee): @drbugfinder-work, @ahg-g
Responsible SIGs: sig-apps
Enhancement target (which target equals to which milestone):
- Alpha release target (1.21):
- Beta release target (1.22):
- Stable release target (1.24):
[x] Alpha
- [x] KEP (k/enhancements) update PR(s): https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/2255-pod-cost
- [x] Code (k/k) update PR(s): https://github.com/kubernetes/kubernetes/pull/99163
- [x] Docs (k/website) update PR(s): https://github.com/kubernetes/website/pull/26739
[x] Beta
- [x] KEP (k/enhancements) update PR(s): https://github.com/kubernetes/enhancements/pull/2619
- [x] Code (k/k) update PR(s): https://github.com/kubernetes/kubernetes/pull/101080, https://github.com/kubernetes/kubernetes/pull/101003
- [x] Docs (k/website) update(s): https://github.com/kubernetes/website/pull/28417

Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.

Jan 12 '21 14:01 ahg-g

/sig apps

Jan 12 '21 14:01 ahg-g

@annajung @JamesLaverack james, you mentioned in the sig-apps slack channel that this enhancement is at risk, can you clarify why? it meets the criteria.

Feb 03 '21 23:02 ahg-g

@ahg-g Just to follow up here too, we discussed in Slack and this was due to a delay in reviewing. We've now marked this as "Tracked" on the enhancements spreadsheet for 1.21.

Thank you for getting back to us. :)

Feb 04 '21 01:02 JamesLaverack

Hi @ahg-g,

Since your Enhancement is scheduled to be in 1.21, please keep in mind the important upcoming dates:

Tuesday, March 9th: Week 9 — Code Freeze
Tuesday, March 16th: Week 10 — Docs Placeholder PR deadline
- If this enhancement requires new docs or modification to existing docs, please follow the steps in the Open a placeholder PR doc to open a PR against k/website repo.

As a reminder, please link all of your k/k PR(s) and k/website PR(s) to this issue so we can track them.

Thanks!

Feb 19 '21 23:02 JamesLaverack

Hi @ahg-g,

Since your Enhancement is scheduled to be in 1.21, please keep in mind the important upcoming dates:

Tuesday, March 9th: Week 9 — Code Freeze

Tuesday, March 16th: Week 10 — Docs Placeholder PR deadline

If this enhancement requires new docs or modification to existing docs, please follow the steps in the Open a placeholder PR doc to open a PR against k/website repo.

As a reminder, please link all of your k/k PR(s) and k/website PR(s) to this issue so we can track them.

Thanks!

done.

Feb 26 '21 20:02 ahg-g

Hi @ahg-g

Enhancements team is currently tracking the following PRs

https://github.com/kubernetes/kubernetes/pull/99163

As this PR is merged, can we mark this enhancement complete for code freeze or do you have other PR(s) that are being worked on as part of the release?

Mar 02 '21 11:03 JamesLaverack

Hi @JamesLaverack , yes the k/k code is merged, docs PR still open though.

Mar 02 '21 11:03 ahg-g

/stage beta

May 05 '21 00:05 ahg-g

/milestone v1.22

May 05 '21 00:05 ahg-g

Hi @ahg-g,

1.22 release team here. After reviewing the kep and the approved PRR, this enhancement is in good shape to meet the 1.22 Enhancements Freeze at 23:59:59 pst on Thursday, May 13.

I have one request to be complete before Enhancements Freeze:

In the kep.yaml, please update the value for latest-milestone to be "v1.22"

Thank you!

May 10 '21 06:05 reylejano

@reylejano thanks, sent https://github.com/kubernetes/enhancements/pull/2705

May 10 '21 14:05 ahg-g

Hi @ahg-g , Thank you for updating the kep.yaml. This enhancement is all set for the 1.22 Enhancements Freeze 🎉

May 11 '21 03:05 reylejano

Hello @ahg-g :wave:, 1.22 Docs Shadow here. This enhancement is marked as ‘Needs Docs’ for the 1.22 release.

Please follow the steps detailed in the documentation to open a PR against dev-1.22 branch in the k/website repo. This PR can be just a placeholder at this time and must be created before Fri July 9, 11:59 PM PDT.

Also, take a look at Documenting for a release to familiarize yourself with the docs requirement for the release.

Thank you! 🙏

May 19 '21 01:05 carlisia

Hi @ahg-g . Are there any code changes planned for the 1.22 release?

Jun 22 '21 00:06 reylejano

Only the doc PR is waiting to be merged, PR is already up for review.

Jun 22 '21 00:06 ahg-g

Hi @ahg-g ,

Looks like only the following k/k PRs were needed for the 1.22 release. Just confirming since code freeze is about 2 weeks away on July 8, 2021.

kubernetes/kubernetes#101080
kubernetes/kubernetes#101003

Thank you!

Jun 23 '21 02:06 reylejano

Correct, we are just waiting on the docs PR.

Jun 23 '21 10:06 ahg-g

Hi @ahg-g ,

Looks like you're all set for the 1.22 code freeze on July 8, 2021 🎉

Jun 29 '21 21:06 reylejano

I wanted to capture some discussion notes from twitter and others :)

A long time back I proposed a pod probe for this purpose. The pod serves HTTP with a known schema (e.g. JSON { cost uint32 }). kubelet hits it periodically and publishes the cost into the pod status.

The meaning of cost is opaque, but higher value = higher cost. Within a deployment or RS, we can assume all cost values are comparable (e.g. 100 means the same thing). We CAN NOT assume that across sets.

The pod is the thing that knows best whether it is 10% done or 90% done, or how many clients it has active. For things that don't know it themselves, a controller can do it.

Every time we build a feature on annotations we end up regretting it. I'd really like to reopen this discussion for a "real" API.

Aug 11 '21 22:08 thockin

Also, as this constitutes an API, KEPs like this really need to have API approvers.

Aug 11 '21 23:08 thockin

Also, as this constitutes an API, KEPs like this really need to have API approvers.

It did, @liggitt was the reviewer, but I guess we forgot to mention it in the merged KEP.

Aug 11 '21 23:08 ahg-g

I wanted to capture some discussion notes from twitter and others :)

A long time back I proposed a pod probe for this purpose. The pod serves HTTP with a known schema (e.g. JSON { cost uint32 }). kubelet hits it periodically and publishes the cost into the pod status.

During the KEP review, there were concerns about updating the cost frequently and causing significant load on the api-server. We mention that explicitly in the Risks and Mitigations section, and warned users against doing it.

The meaning of cost is opaque, but higher value = higher cost. Within a deployment or RS, we can assume all cost values are comparable (e.g. 100 means the same thing). We CAN NOT assume that across sets.

Sorry, I am not sure what you are proposing. But in the current implementation, which takes cost into account only on replicaset scale down, pods from the same replicaset have the same scale. Cost has no meaning across replicasets.

The pod is the thing that knows best whether it is 10% done or 90% done, or how many clients it has active. For things that don't know it themselves, a controller can do it.

Every time we build a feature on annotations we end up regretting it. I'd really like to reopen this discussion for a "real" API.

Aug 11 '21 23:08 ahg-g

ACK. Jordan very gently corrected me. I didn't mean to imply anything sneaky happened. :)

What you say about cost is the same as what I am saying. I am mostly arguing that annotations make really bad APIs and we almost always end up regretting it.

On Wed, Aug 11, 2021 at 4:17 PM ahg-g @.***> wrote:

I wanted to capture some discussion notes from twitter and others :)

A long time back I proposed a pod probe for this purpose. The pod serves HTTP with a known schema (e.g. JSON { cost uint32 }). kubelet hits it periodically and publishes the cost into the pod status.

During the KEP review, there was concerns about updating the cost frequently and causing significant load on the api-server. We mention that explicitly in the Risks and Mitigations section https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/2255-pod-cost#risks-and-mitigations, and warned users against doing that.

The meaning of cost is opaque, but higher value = higher cost. Within a deployment or RS, we can assume all cost values are comparable (e.g. 100 means the same thing). We CAN NOT assume that across sets.

Sorry, I am not sure what you are proposing. But in the current implementation, which takes cost into account only on replicaset scale down, pods from the same replicaset have the same scale. Cost has no meaning across replicasets.

The pod is the thing that knows best whether it is 10% done or 90% done, or how many clients it has active. For things that don't know it themselves, a controller can do it.

Every time we build a feature on annotations we end up regretting it. I'd really like to reopen this discussion for a "real" API.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kubernetes/enhancements/issues/2255#issuecomment-897223200, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKWAVH6CXZ7EIIRDTBVCNTT4MAI7ANCNFSM4V7IHR4Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

Aug 11 '21 23:08 thockin

I came here to mirror a little discussion we had on twitter. @thockin beats me to it! . I want to share additional notes for your consideration:

I want to start by thank you for this. I do appreciate the work, and i do appreciate the feature. I believe it is a step in the right path.
I second the view that apis > annotations. apis are versioned, validated, defaulted. Annotation OTH are opaque string maps that are prone to typos and other errors. Yes, we can add hooks to validate them but that comes with additional cost. I just want us to consider what will happen in some future where we or some other folks wanted to extend this feature to include a data structure like value. I am not saying it won't be possible but it will be ugly to work with and even more prone to corner cases.
The name "pod deletion cost" is a misnomer. Pods (specially of the same replica-set have a little more or a little less equal cost to delete). They do however have different cost to start and bring to ready e.g. pods that have cached more data are likely to cost more to start else where.
Now as a user. I am thinking ok. I need to tell the system that my pod is becoming more expensive to move or restart. I am not a kubernetes developer nor I care about the SDK. But in order for me to use this feature i need to learn the entire stack (including various AuthN/AuthZ stuff + PATCH v POST + state management stuff such as resource versioning etc.) that I really don't use otherwise and they present no value to me otherwise. This will yield into a higher barrier to entry/feature-use. I can maybe expose some metric so cluster admin can query then use kubectl to update the annotation? But that is even more costly. A better approach for that is:
1. A unified metric somewhere that system uses (but that implies that the software that manages the metric is same across all clusters and the metric itself is defined the same) and obviously it does mean that i need to learn more as a developer to use this feature. Again the cost to use is too high.
2. A probe. I already know about these. That probe can either be as @thockin described or the stdout of running a simple script that produces a versioned api instance serialized as json/yaml not very different from the json/yaml I already used to create my app. As a developer i already know how to do these things. Hooking this to my queue depth or # of inflight requests or # of cached key value items can be done using a 100% know-how and expertise i already have.

Now let us widening the scope a bit The cost of delete (or move) is not just an RC thing. It is a common information that is used by multiple users for a lot of use cases. For example:

A cluster auto-scaler trying to identify which node to scale down (nodes with less pod removal costs wins).
A cluster admin trying to make a choice between two nodes to take own down for tests/maintenance/what-not.
Kubelet trying to make a choice between two pods of the same QoS to evict in response to node memory pressure.

If you noticed the deletion - or restart (we really need a better name) - cost is the closely related reliability impact of voluntary deleting or moving that pod else where. But that cost is not scoped to RC. But scoped the entire pods on cluster expressed as:

Default Cost. Again this is not QoS. Two pods of the same QoS may have widely different cost to move. It allows the feature to be used even when pods don't report cost.
Multiplier. This is a positive or negative number (or even a %) that multiplies the calculated cost. For example I am a cluster admin. I know that web externally facing tier have a lot of SLA impact thus they present additional reliability challenge when moving them so i will set multiplier to 200%. IOW the multiplier differentiate pods on the same cluster by their nature.
Reported cost (the probe thing discussed above). This overrides default cost when reported.

That means actual cost == default-cost * multiplier || reported-cost * multiplier favoring reported-cost when available.

As a user (consuming this info) I do have more information to make a highly informed decision that present the least surprises reliability wise.

From an api stand point, it will look like this:

//add pod spec field for default cost
pod.spec.defaultCost = int

//add pod status field for reported-cost
pod.spec.reportedCost = int

Multiplier can follow a cluster wide config by introducing a new type that carries a selector and multiplier value. This well help regulating teams/devs competing for higher multiplier by RBACing the api.

I do have one disagreement about this statement made above

During the KEP review, there were concerns about updating the cost frequently and causing significant load on the api-server. We mention that explicitly in the Risks and Mitigations section, and warned users against doing it.

Kubelet operates with status update logic backed by a heavy cache. Kubelet is super pedantic about when should it patch pod status (i.e., call api-server). So the risk above is not as high as portrayed. There is already logic there that does not perform unnecessarily patch calls and knows how to elevate various statuses changes into one call.

Aug 12 '21 20:08 khenidak

Kal's going big :)

Aug 12 '21 20:08 thockin

I strongly agree with @khenidak that "pod deletion cost" should be in the Pod status not an anotation.

Aug 13 '21 00:08 thesuperzapper

Thanks @khenidak for the great suggestions! I believe they all can be addressed as potential followups. I am glad that there is a common understanding that this was a step in the right direction. The current implementation didn't require major changes (most importantly no kubelet changes), yet it already unblocked a number of use cases and started this conversation.

The status field vs annotation was discussed during KEP review. We could revisit this decision and upgrade to a status field in the future, in fact it seems like using an annotation was a better decision in hindsight since there seems to be appetite for something more advanced. Note that the api-server does validate the value of this annotation on update (it is considered a known annotation), not as good as what we can do with a proper status field, but it is something.

I am aware that Kubelet (and most core controllers for that matter) try hard to optimize api server calls. I just feel that this only adds yet another potential scalability concern; for example, consider the case where the pod publishes a fine tuned cost that changes very frequently yet we rarely actually scale up/down the deployment (perhaps priority and fairness can help here by limiting the priority of such updates). In any case, this will be an interesting discussion during PRR review.

Aug 13 '21 02:08 ahg-g

@ahg-g I still want to see a couple of things:

1 - The ability for a Pod to (safely) update its own pod-deletion-cost:

Right now this is never going to be safe, as updating the annotation requires a PATCH API call, so running a sidecar container to update the pod-deletion cost would overload the apiserver.

2 - Some level of integration with HorizontalPodAutoscaler:

Right now, HPAs can't be used with pod-deletion-cost, as there is no way to intercept a scale-down event and annotate Pods before it happens.

3 - The possibility of non ReplicaSets taking pod-deletion-cost into account:

Like what @khenidak mentioned with cluster-autoscaler taking the Pod cost into account when deciding which Node to remove.

Aug 13 '21 02:08 thesuperzapper

@ahg-g i would subscribe to iterating on this. But we have to start by moving the field from annotation to a status field :-)

Aug 19 '21 20:08 khenidak

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Nov 17 '21 22:11 k8s-triage-robot

enhancements
enhancements copied to clipboard

Scale down a deployment by removing specific pods (PodDeletionCost)

Enhancement Description

1 - The ability for a Pod to (safely) update its own pod-deletion-cost:

2 - Some level of integration with HorizontalPodAutoscaler:

3 - The possibility of non ReplicaSets taking pod-deletion-cost into account:

enhancements enhancements copied to clipboard

Scale down a deployment by removing specific pods (PodDeletionCost)

Enhancement Description

1 - The ability for a Pod to (safely) update its own pod-deletion-cost:

2 - Some level of integration with HorizontalPodAutoscaler:

3 - The possibility of non ReplicaSets taking pod-deletion-cost into account:

enhancements
enhancements copied to clipboard