enhancements KEP-3189: Deployment/ReplicaSet Downscale Pod Picker

One-line PR description: Initial PR for KEP-3189

Issue link: https://github.com/kubernetes/enhancements/issues/3189

Other comments: There are still some TODO sections, and I welcome help from reviewers in fixing any issues.

Jan 27 '22 08:01 thesuperzapper

The committers are authorized under a signed CLA.

:white_check_mark: Mathew Wicks (49075a97204d9d263f119f0b25566523d41d0733)

Jan 27 '22 08:01 linux-foundation-easycla[bot]

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

:memo: Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
If you have done the above and are still having issues with the CLA being reported as unsigned, please log a ticket with the Linux Foundation Helpdesk: https://support.linuxfoundation.org/
Should you encounter any issues with the Linux Foundation Helpdesk, send a message to the backup e-mail support address at: [email protected]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Jan 27 '22 08:01 k8s-ci-robot

Welcome @thesuperzapper!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. :smiley:

Jan 27 '22 08:01 k8s-ci-robot

Hi @thesuperzapper. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jan 27 '22 08:01 k8s-ci-robot

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: thesuperzapper To complete the pull request process, please assign kow3ns after the PR has been reviewed. You can assign the PR to them by writing /assign @kow3ns in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/sig-apps/OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Jan 27 '22 08:01 k8s-ci-robot

/assign

Jan 27 '22 14:01 alculquicondor

/assign @ravisantoshgudimetla @atiratree

Jan 27 '22 14:01 soltysh

I signed it

Jan 28 '22 00:01 thesuperzapper

/sig architecture

Jan 28 '22 00:01 thesuperzapper

@alculquicondor I have updated the PR based on your comments (commit https://github.com/kubernetes/enhancements/commit/8beae1b5f2f9e7ed7814ce7a711c8633fe6f46d3) and responded to others in the thread.

I have also added sig-architecture as a participating SIG, do you know who can review from their end?

Jan 28 '22 00:01 thesuperzapper

@dims can you take a look or assign someone?

Jan 28 '22 14:01 alculquicondor

@alculquicondor - this definitely needs API review especially since it adds a "user-provided REST API"

/assign @liggitt @thockin

(API review please)

Jan 28 '22 14:01 dims

This is the first time I have seen this proposal (but the idea is old :). I'm not super keen on the approach, but given that it's my first contact, I am open minded.

I dislike putting the cluster in the control path (yes, I dislike webhooks for the same reason). Forcing the user to run a helper-deployment per deployment just feels clunky. More than that, though, it feels like there HAS to be a simpler and more general way to do this.

My first instinct is a probe, but I guess before that I should ask - who knows which pod is best? Do the pods themselves have an idea of their own cost (e.g. number of connections open, or number of shards loaded)? Or do we have to assume only some external actor can possibly know?

As much as I dislike the annotation, it seems like a better answer than asking a 3rd party synchronously. The idea was, IIRC, that you could have your smart babysitter updating the annotations at whatever rate is appropriate. And if that is not scalable, we could break that out to a new resource or something (either 1:1 with Pod or 1:N like EndpointSlice).

Then, Deployment controller could simply say "get the PodCost of all pods labelled foo=bar". It's fundamentally async, which I see as a net positive.

The lack of an "Alternatives" section makes it hard to know where this is coming from. I am sure you've discussed some of these points already...

@thockin You are correct that we have discussed these issues quite a bit, in fact, my initial proposal https://github.com/kubernetes/kubernetes/issues/107598 included the concept of a "heuristic cost probe". However, implementing such a probe would be prohibitively complex, as the kube-controller is unable to make HTTP/exec probes on containers directly (this must be done by the Node's kubelet).

To help reviewers, I have listed all the alternatives I can think of here, and why they will not adequately solve the problem.

(I will update the KEP with these shortly)

Alternatives

ALTERNATIVE 1: Pod-Deletion-Cost annotation

Same as the current controller.kubernetes.io/pod-deletion-cost annotation.

USAGE PATTERN 1 - the annotation kept up-to-date with a controller:

PROBLEM 1: every update requires a PATCH call to the kube-apiserver (probably overloading it)
PROBLEM 2: this is wasteful for deployments that rarely scale down
- calculating and updating the cost could be expensive, and that work is wasted if not actually used to downscale

USAGE PATTERN 2 - the annotation is updated only when downscaling:

PROBLEM 1: existing scaling tools like HorozontalPodAutoscaler can't be used
- the annotation/status would need to be updated BEFORE downscaling, and we can't predict when HorozontalPodAutoscaler will downscale
- therefore, users must write their own scalers that update the annotation/status before downscaling (making it inaccessible for most users)
PROBLEM 2: after each downscale, any annotations added by the scaler will need to be cleared (they will become out of date)
- related to this, we must wait to clear them before we can start the next downscale (or the old annotations will impact the new scale)
PROBLEM 3: costs may be out-of-date by the time the controller picks which pod to downscale
- some apps like airflow/spark may have quite rapidly changing "best" pods to kill

ALTERNATIVE 2: Pod-Deletion-Cost http/exec probe

A new http/exec probe is created for Pods which returns their current pod-deletion-cost.

USAGE PATTERN 1 - probes are used every X seconds to update a Pod status field:

(Suffers from the same problems as "USAGE PATTERN 1" of "ALTERNATIVE 1")

USAGE PATTERN 2 - probes are ONLY used when downscaling:

PROBLEM 1: the controller cannot make a probe request to Pods (this must be done by the Node's kubelet)
- to solve this you would need a complex system that has the controller "mark" the pods for probing by the kubelet
PROBLEM 2: to find the "best" Pod to kill, we would need to probe every single Pod, this is not scalable
- to solve this, you would need to use a heuristic approach, e.g. only checking a sample of Pods and returning the lowest cost from the sample

ALTERNATIVE 3: Pod-Deletion-Cost API

A central user-managed API is queried by the controller and returns the pod-deletion-cost of each Pod. (Rather than returning a list of chosen_pods and tied_pods like in this KEP)

PROBLEM 1: the user-managed API can't exit-early
- because a pod-deletion-cost must be returned for each Pod, the API can't have the concept of a "free" Pod, which if found can be immediately returned without checking the other Pods
PROBLEM 2: the user-managed API may calculate the pod-deletion-cost of more Pods than necessary
- the API is unaware of how many Pods are actually planned to be removed, so will calculate the pod-deletion-cost of more Pods than is necessary

Other Comments:

I want to stress that people are very hungry for this feature, issue https://github.com/kubernetes/kubernetes/issues/4301 is still active after nearly 7 years, and issue https://github.com/kubernetes/kubernetes/issues/45509 has over 121 comments! (some advanced users have even forked kube-controller to implement this feature)

Questions:

@thockin Do you have any other alternatives in mind?
@thockin What are your specific concerns about synchronously calling the pod-picker API from the controller?

Feb 01 '22 06:02 thesuperzapper

Another option might be a probe that updates on-demand (e.g. a way to ask
kubelets to update the cost metric "now-ish", plus a timestamp). E.g. HPA says "update costs for pods matching ", then makes its decision.

Scaling concerns seem premature - SOMETHING is updating the control-plane for this new API, or else all the same arguments apply.

consider apps like airflow/spark where workers may accept new tasks on a second-to-second basis

If they have been updated O(seconds) ago, there isn't much cost to making a less-than-perfect decision here, right? How much additional work would we discard?

Do you have any other alternatives in mind?

Given that I have spent all of 2 hours today thinking about it, I don't have anything concrete, no. What I have are concerns about how this (fairly complex) proposal will stand over time. Don't get me wrong, I totally understand the need, but I don't know what literature exists on this topic, what other similar systems have done in the past, what works well and what doesn't FOR THIS SPECIFIC PROBLEM. It can't be the first time anyone has tried to solve it. Where's the exploration of those?

As another alternative (not a good one, but worth considering) - fork ReplicaSet and make an entirely new API that has the semantics you want, and show the community how well it works. Success is it's own punishment :)

What are your specific concerns about synchronously calling the pod-picker API from the controller?

It seems that, if this is a successful API, users will need a babysitter Deployment for every app Deployment, running an little API which is almost as critical as their app. They will need monitoring and alerting and metrics to manage the babysitter. That seems like a poor bargain to me. I'd rather find a way to solve it for a whole class of users at once.

issue kubernetes/kubernetes#4301 is still active after nearly 7 years

Mostly for lack of a champion. I am glad to see someone step up to it - it's not an easy problem. The truth is that Kubernetes is becoming somewhat brittle with a million new things that solve one little use-case (not to minimize the need for this) and we NEED to be looking for places we can make broadly applicable fixes instead of small, domain-specific ones. And that takes TIME and EFFORT, which usually means people dry up and blow away before reaching a good answer.

and issue kubernetes/kubernetes#45509 has over 121 comments

121 is not a big number for a change of this magnitude! This is subtle and (with what very limited thinking I have put in today) feels ripe to break in bad ways.

Feb 01 '22 07:02 thockin

Scaling concerns seem premature - SOMETHING is updating the control-plane for this new API, or else all the same arguments apply.

@thockin I am confused about what you are saying here? In this Pod-Picker implementation, there is no scaling in the number of Pods, a single API request is made no matter how many Pods are running.

But for all the other ideas of "probes", "annotations", "status fields", these are inherently more requests to the kube-apiserver for each extra Pod that is running.

consider apps like airflow/spark where workers may accept new tasks on a second-to-second basis

If they have been updated O(seconds) ago, there isn't much cost to making a less-than-perfect decision here, right? How much additional work would we discard?

This is true from a "work done" perspective, but losing work is not the only way a system can be impacted by a shard/worker being stopped. However, what I am trying to highlight with that problem is that patching annotations takes time, and using them to exchange pod-costs introduces an unnecessary delay when compared with an approach that explicitly asks a system to pick which pods to kill.

Do you have any other alternatives in mind?

Given that I have spent all of 2 hours today thinking about it, I don't have anything concrete, no. What I have are concerns about how this (fairly complex) proposal will stand over time. Don't get me wrong, I totally understand the need, but I don't know what literature exists on this topic, what other similar systems have done in the past, what works well and what doesn't FOR THIS SPECIFIC PROBLEM. It can't be the first time anyone has tried to solve it. Where's the exploration of those?

It's not really that complex of a problem, it's just that Kubernetes has never really tried to solve it. A simple statement of the problem is "I want to choose which Pods are removed when downscaling", and I think the Pod-Picker approach is just a clear way for people to do this.

As another alternative (not a good one, but worth considering) - fork ReplicaSet and make an entirely new API that has the semantics you want, and show the community how well it works. Success is it's own punishment :)

I strongly believe this is not necessary, and that we will be able to design an elegant implementation that everyone will be happy with.

What are your specific concerns about synchronously calling the pod-picker API from the controller?

It seems that, if this is a successful API, users will need a babysitter Deployment for every app Deployment, running an little API which is almost as critical as their app. They will need monitoring and alerting and metrics to manage the babysitter. That seems like a poor bargain to me. I'd rather find a way to solve it for a whole class of users at once.

I am not sure that's something we need to worry about. Firstly, people who don't want to use this feature don't have to, and secondly, if a Pod-Picker API is down, the kube-controller will just default back to the current behavior, it's not like their containers will stop working or something.

We can prevent unnecessary dependence on Pod-Picker by stressing that Kubernetes will only use it on a "best-effort" basis. Also, this allows us to tell people that their Pod-Pickers can be "best-effort" themselves, for example, they don't always have to return a result (like if 10,000 pods are requested) or they might only return a partial result.

issue kubernetes/kubernetes#4301 is still active after nearly 7 years

Mostly for lack of a champion. I am glad to see someone step up to it - it's not an easy problem. The truth is that Kubernetes is becoming somewhat brittle with a million new things that solve one little use-case (not to minimize the need for this) and we NEED to be looking for places we can make broadly applicable fixes instead of small, domain-specific ones. And that takes TIME and EFFORT, which usually means people dry up and blow away before reaching a good answer.

I don't believe a Pod-Picker API is domain-specific, it simply solves the problem "I want to choose which Pods are removed when downscaling", which is applicable across almost every type of app that runs on Kubernetes.

Feb 02 '22 09:02 thesuperzapper

Scaling concerns seem premature - SOMETHING is updating the control-plane for this new API, or else all the same arguments apply.

@thockin I am confused about what you are saying here? In this Pod-Picker implementation, there is no scaling in the number of Pods, a single API request is made no matter how many Pods are running.

A single request to this new API. But where is that API getting information? It must be collecting information from each pod or some other mechanism that either polls or acts on demand. My point was that you shift the burden, but don't eliminate it.

Here's where I stand, the day before KEP freeze.

First, there's no approver or reviewer in kep.yaml.

I don't feel equipped to approve this (yet). I don't have industry context in this area, and I don't have time to do my own research before EOD tomorrow. Clearly some conversations must have happened (right?) about why this option is less bad than all other options, but the KEP doesn't really explain that. I don't see sig-apps leads or sig-autoscaling leads here (apologies if I am just missing their approvals).

What I see is (to my eyes) worryingly complex and sets a precedent I am not sure we want to set. It's using "scalability" as a way to dismiss other options without much detailed testing or analysis (was that done?).

I think the way to proceed here is to see how much of a proof-of-concept you can do out-of-tree to make a more convicing argument.

Feb 02 '22 23:02 thockin

Here's where I stand, the day before KEP freeze.

@thockin to be clear, I was not trying to get this in for 1.24, it just happed to be proposed around the freeze time!

First, there's no approver or reviewer in kep.yaml.

I think I need support from @alculquicondor or @soltysh in assigning specific reviewers to keep this progressing.

I don't feel equipped to approve this (yet). I don't have industry context in this area, and I don't have time to do my own research before EOD tomorrow. Clearly some conversations must have happened (right?) about why this option is less bad than all other options, but the KEP doesn't really explain that. I don't see sig-apps leads or sig-autoscaling leads here (apologies if I am just missing their approvals).

We have discussed this in the SIG-Apps meetings, and in those @soltysh and I came to the conclusion that raising a KEP for this proposal was the next step so we can use the KEP process to nail out the correct solution (assuming there is one).

Don't worry, we aren't asking you to approve this all on your own! Instead, we need your expert input on the approach that is most likely to succeed so the KEP can be provisionally approved.

This is what makes KEPs great, they help us track complex changes, and allow for proposals to be killed at any phase, possibly even before alpha implementation (if a provisionally approved KEP fails to meet a provision).

What I see is (to my eyes) worryingly complex and sets a precedent I am not sure we want to set. It's using "scalability" as a way to dismiss other options without much detailed testing or analysis (was that done?).

I am all for simplicity and would love it if there was a "simpler" solution, however, I am not sure it's fair to describe this as "worryingly complex". To make app-aware decisions on downscale, the application must provide its "preferred pods" in some way, this KEP proposes that way as an API which the controller asks directly, and this seems like the most direct approach to me (but I may be wrong, hence this discussion).

I think the way to proceed here is to see how much of a proof-of-concept you can do out-of-tree to make a more convicing argument.

I think we should try and stay within the KEP framework, and try and shape this KEP into something that seems likely to succeed and be accepted (if implemented). Without this agreement, it seems unfair to ask people to start implementing something that has no clear path to approval.

For example, we may provisionally accept this KEP, on the condition that any proposed alpha implementation PR meets some requirements, like "at most X impact to controller SLO", "not impacting xxxx", etc.

But before then, we need to come to an agreement on some of the finer details like API specification, for example.

Feb 04 '22 10:02 thesuperzapper

Before proceeding with this proposal, I think it's worth exploring the area as Tim suggested and provide it as a source for this proposal. Based on that we can definitely make a more informed decision.

Feb 04 '22 13:02 soltysh

A working PoC is worth a hundred KEP reviews :)

Feb 04 '22 20:02 thockin

@sftim @thockin @alculquicondor I have pushed commit https://github.com/kubernetes/enhancements/pull/3190/commits/5086ff87ef00561292a2b0d612e3f06a7072451f, which addresses some of the currently outstanding comments, and makes some of the harder-to-read sections more clear.

Mar 10 '22 08:03 thesuperzapper

I appreciate the continued discussion, but it's less than a week to KEP time again (sigh) and this seems far from resolved - is it fair to assume it's not aiming to go into 25?

Jun 10 '22 23:06 thockin

@sftim @thockin @alculquicondor I think I have a better idea than what I proposed in this KEP, which builds on the existing pod-deletion-cost work rather than replacing it.

My initial thoughts can be found at https://github.com/kubernetes/kubernetes/issues/107598#issuecomment-1210060884, if we think the new approach is better than this KEP, let's shelve this one and I will write up a new KEP.

The gist of the idea is that we can make pod-deletion-cost a more transient value (rather than only storing it in annotations), by extending the /apis/apps/v1/namespaces/{namespace}/deployments/{name}/scale API so when a caller sends a PATCH which reduces replicas, they can include the pod-deletion-cost of one-or-more Pods, these costs will only affect the current down-scale (unlike the annotations, which must be manually cleared after scaling to remove their effect).

Aug 11 '22 23:08 thesuperzapper

Will this KEP be updated with the newer ideas, or will you open a new one? I'd suggest closing this and opening a new one, just to clear the history :)

Sep 21 '22 23:09 thockin

@thockin, I agree that a separate KEP for the proposal in https://github.com/kubernetes/kubernetes/issues/107598#issuecomment-1210060884 is a good idea.

I am currently a bit bogged down, but after that, I will look into making the new KEP.

(Ironically, what I am working on is an autoscaler for Apache Airflow celery workers that uses the pod-deletion-cost feature, which is giving me lots of insight into how we should build the scale API changes)

Sep 28 '22 07:09 thesuperzapper

enhancements enhancements copied to clipboard

KEP-3189: Deployment/ReplicaSet Downscale Pod Picker

Alternatives

ALTERNATIVE 1: Pod-Deletion-Cost annotation

ALTERNATIVE 2: Pod-Deletion-Cost http/exec probe

ALTERNATIVE 3: Pod-Deletion-Cost API

Other Comments:

Questions:

enhancements
enhancements copied to clipboard