enhancements icon indicating copy to clipboard operation
enhancements copied to clipboard

KEP-2021: Support scaling HPA to/from zero pods for object/external metrics

Open johanneswuerbach opened this issue 4 years ago • 43 comments

Enhancement issue: https://github.com/kubernetes/enhancements/issues/2021

Rendered version https://github.com/johanneswuerbach/enhancements/blob/kep-2021-alpha/keps/sig-autoscaling/2021-scale-from-zero/README.md

johanneswuerbach avatar Sep 26 '20 19:09 johanneswuerbach

Welcome @johanneswuerbach!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. :smiley:

k8s-ci-robot avatar Sep 26 '20 19:09 k8s-ci-robot

Hi @johanneswuerbach. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Sep 26 '20 19:09 k8s-ci-robot

@kikisdeliveryservice thank you, folder structure fixed.

johanneswuerbach avatar Sep 28 '20 07:09 johanneswuerbach

/assign @gjtempleton

I hoped to get the current state documented first, before we start talking about how we could move towards beta. Does that make sense?

johanneswuerbach avatar Sep 30 '20 19:09 johanneswuerbach

Hey @johanneswuerbach

I think that's a good plan for now.

gjtempleton avatar Oct 01 '20 13:10 gjtempleton

@johanneswuerbach @gjtempleton how's this going so far? Can I be of any help?

jeffreybrowning avatar Oct 22 '20 16:10 jeffreybrowning

@jeffreybrowning yes. I assumed we could merge this KEP already as is to document the current state and then iterate on it towards beta. I'll try to present it at the next sig-autoscaling meeting to get some input and discuss next steps, but if you have any input I'm happy to incorporate it already.

johanneswuerbach avatar Oct 24 '20 19:10 johanneswuerbach

@johanneswuerbach missed autoscaling meeting today -- do you have clarity on next steps?

jeffreybrowning avatar Oct 26 '20 18:10 jeffreybrowning

Me too, I assumed those are bi-weekly or are the meetings on-demand?

johanneswuerbach avatar Oct 26 '20 20:10 johanneswuerbach

In all honesty, it would have been my first one -- the work you started on this feature for beta has encouraged me to get involved and help you push this through.

It will really help using HPA + async job queue to scale down to 0 workers when not processing tasks.

jeffreybrowning avatar Oct 27 '20 14:10 jeffreybrowning

Hey, the meetings are held weekly at 14:00 UTC, you can see more info including a link to the agenda if you want to raise it here.

I've raised this at a previous meeting, so the community's already aware this work has started, but it would be good to have some more in-depth discussion.

gjtempleton avatar Oct 27 '20 14:10 gjtempleton

Thanks I got confused by the mention of biweekly here https://github.com/kubernetes/community/tree/master/sig-autoscaling#meetings and assumed the calendar invite is wrong. Will ask upfront next time :-)

johanneswuerbach avatar Oct 27 '20 14:10 johanneswuerbach

Pinging here. How's this enhancement coming along? What are next steps?

jeffreybrowning avatar Nov 30 '20 17:11 jeffreybrowning

Pinging back before the holidays hit.

What are next steps?

jeffreybrowning avatar Dec 19 '20 02:12 jeffreybrowning

Sorry for leaving this lingering @jeffreybrowning, it wasn't forgotten just work happened :-(

@gjtempleton during the SIG meeting you said merging the KEP only documenting the alpha state would be okay. I think that is done, or do you see anything missing?

For the beta stage the following needs to happen:

  • Figure out a way to deal with k8s upgrade/downgrade when minReplicas: 0, currently this was worked around by the feature gate, but could look like https://github.com/kubernetes/kubernetes/pull/74526#issuecomment-497549411
  • Resolve the issue that going from minReplicas 0 ~> 1 seems to not trigger a scale up and the scaled resource stays at 0 replicas.
  • Figure out how to deal with the fact that HPA can no longer be disabled, which was previously done by setting replicas: 0 on the scaled resource.

johanneswuerbach avatar Dec 20 '20 14:12 johanneswuerbach

No worries @johanneswuerbach, and didn't want to rush you. Just think your work already is important and want to see it to Beta for this feature!

Asking some questions to spin myself up:

  1. Is HPA with minReplicas: 0 leading to disabling the HPA an intended feature? Is there known methodology behind wanting a way to disable an HPA rather than just removing it? It seems to me that not many k8s resources have intentional disabled states, and instead require deleting?
  2. Resolve the issue that going from minReplicas 0 ~> 1 seems to not trigger a scale up and the scaled resource stays at 0 replicas. Is there an issue for this already? Just to be clear, does this mean that the HPA increasing its recommended replica count does not cause the scaled resource to scale?

The recommendations on upgrate to Beta on how to handle downgrades makes sense: since downgrades would require the feature-gate to be set manually, than moving down to version with the feature-gate requirement will not be a surprise.

jeffreybrowning avatar Dec 30 '20 23:12 jeffreybrowning

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

fejta-bot avatar Mar 30 '21 23:03 fejta-bot

Can we remove the stale state on this? I think this is a very important and useful feature for kubernetes

MattJeanes avatar Mar 31 '21 00:03 MattJeanes

/remove-lifecycle stale

gjtempleton avatar Mar 31 '21 00:03 gjtempleton

Sorry for the lack of updates, I pushed a refined proposal around handling disabling and updating from 0 to 1.

Let me know what you think.

johanneswuerbach avatar Apr 11 '21 21:04 johanneswuerbach

@elmiko thanks for your review, I've updated the KEP now with a different proposal. Let me know what you think.

johanneswuerbach avatar May 11 '21 20:05 johanneswuerbach

Going to risk bumping the thread here since its been a few weeks. @elmiko

jeffreybrowning avatar May 28 '21 15:05 jeffreybrowning

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Aug 30 '21 15:08 k8s-triage-robot

/remove-lifecycle stale

Not sure if I have permission to do that but I'll give it a go.

This feature I believe remains a great addition to Kubernetes and while this can be achieved with something like KEDA, I think something like this should be a part of the core Kubernetes experience.

MattJeanes avatar Aug 30 '21 15:08 MattJeanes

Where does this feature stand and how can the community help move it along? Currently my team is considering using https://github.com/dailymotion-oss/osiris as work-around, but given it's "experimental" we'd prefer to use this feature, should it be released.

noahlz avatar Oct 19 '21 13:10 noahlz

@johanneswuerbach anything we can do to help you with this proposal?

JCMais avatar Nov 14 '21 21:11 JCMais

/assign @josephburnett

johanneswuerbach avatar Feb 07 '22 22:02 johanneswuerbach

Just want to add a note of support for completing this as soon as possible. I think you are saying feature is beta in 1.25, and many in industry wish it was sooner. Last relevant code fix I could find was in 1.21 . I assume we cannot pronounce this beta in 1.23?

c3ivodujmovic avatar Feb 11 '22 00:02 c3ivodujmovic

Sadly not, the existing design has some issues, which are planned to be addressed in the beta iteration. Afaik 1.24 is already closed for new features, so 1.25 seems to be the earliest target.

Where you can help? First this KEP needs a review to see whether there are any additional blockers otherwise I guess it would also be good to start with the implementation, if you have time.

johanneswuerbach avatar Feb 11 '22 15:02 johanneswuerbach

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jun 12 '22 15:06 k8s-triage-robot