autoscaler
autoscaler copied to clipboard
Adds feature to crashlooping delete pods after evition fails
Which component this PR applies to?
vertical-pod-autoscaler
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR adds a new experimental feature, which can be enabled via a command line flag which attempts a pod deletion should the eviction fail for whatever reason. The deletion is bound to various conditions to not blindly delete a pod where the eviction failure might be justified. In general the deletion only occurs if at least one container was OOMKilled and is currently in CrashLoopBackoff and the number of restarts exceeeds the configured threshold (also behind a flag).
Which issue(s) this PR fixes:
Fixes #4730
Special notes for your reviewer:
Does this PR introduce a user-facing change?
New feature to delete crashlooping pods after eviction fails
Welcome @RuriRyan!
It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.
You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.
You can also check if kubernetes/autoscaler has its own contribution guidelines.
You may want to refer to our testing guide if you run into trouble with your tests not passing.
If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!
Thank you, and welcome to Kubernetes. :smiley:
Thanks. I think it would be good to have an enhancement proposal for this.
In the enhancement proposal I'd like to see:
- Short explanation why we want this
- Explanation why do this in VPA instead of in PDB
- Description of the implementation
I was writing in hurry. Let me try once more.
Thanks for sharing this. I'll take a look when I have some time (I expect next week).
Reading the PR will help me understand how much complexity implementing this would add to VPA (which is one concern).
I still would like to understand in some more detail why implementing this as a part of PDB is a problem. If we could have support for evicting pods in PDB it would be better. But that's "if". On the other hand if implementing this in PDB is not an option I'd like to know why but I guess it makes sense to make an improvement in VPA.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: RuriRyan
Once this PR has been reviewed and has the lgtm label, please assign jbartosik for approval by writing /assign @jbartosik in a comment. For more information see:The Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
/kind api-change
/kind documentation
LGMT I think I need someone who knows more about API to take a look too
This PR may require API review.
If so, when the changes are ready, complete the pre-review checklist and request an API review.
Status of requested reviews is tracked in the API Review project.
I meant to leave those comments on the PR with KEP
@jbartosik I 'm considering this as "done" now and am patiently awaiting your review :D
/hold
While I'm working on VPA 0.12.0 I don't want to make any changes to VPA that are not related to the release (I think code is good to go but E2E tests are timing out).
I'll remove the hold after I release VPA 0.12.0.
/hold cancel
@RuriRyan I see this has a few open comments, mostly small things.
@RuriRyan I see this has a few open comments, mostly small things.
@avorima Can you take over? :D
Fyi: I have left ionos and no longer have access to this Code.
@jbartosik AFAICT all open review comments have either been fixed or answered
We'd also love to see this PR move forward and eventually get merged.
/cc @jbartosik
@timoreimann looks good to me
There's one thing remaining - checking if this works end to end.
Did you run any (manual?) tests to verify this works as intended?
Can you provide an e2e test that will ensure this keeps working?
I thought that I could take OOM test scenario and modify it:
- Same scenario but with PDB preventing VPA from evicting any pods
- Same scenario with PDB and VPA allowed to delete, verify that recommendation is applied
I'm not sure when I'll be able to do that. I'm worried about merging this without any verification that this works as intended end to end
@timoreimann I want to merge this but need some verification that this works e2e. I see the following options:
- Someone writes e2e sends PR so that we can merge them at the same time
- Someone verifies this works e2e manually, we merge then add e2e test later
I can contribute e2e tests but I'm not sure when. I don't think 2. is significantly faster than 1.
Writing an e2e tests sounds like the right thing to do. I'm happy to look into that when I have some spare time unless someone beats me to it.
I just saw unhealthyPodEvictionPolicy was added for PDBs (see KEP and API change in PDB).
We also saw the original issue happen quite a few times and would want to get this fixed, but the change to PDB seems to solve the problem in a more "k8s native" way instead of work-arounding with the delete call.
@jbartosik @avorima @timoreimann wdyt?
@voelzmo thanks for sharing the KEP. I wasn't aware of it, this does sound like the right approach going forward.
That said, I think there's value in driving this PR to completion regardless: the solution in the KEP will only be available in Kubernetes 1.26 in alpha. Users on older versions will still want to have some kind of solution (at least we do given a large portion of our fleet is affected by the problem regularly). It also seems possible to transition from what this PR proposes to what the KEP outlines by checking for the new field in the PDB once available and honor its value with higher precedence. At least, that's how I understand it should work.
Just my 2 cents of course, final call is to be made by @jbartosik. Coincidentally, I have started working on extending the end-to-end tests as proposed just earlier today, so I could probably add the last missing piece fairly soon.
On second thought, the feature from this PR will likely only be backported so far (if at all), and / or more recent versions of VPA may be incompatible with older Kubernetes versions? So maybe there is only so much value users of older Kubernetes releases could get out of this PR?
I'm not super familiar with VPA's / CA's release policy and compatibility guarantees, so throwing out additional thoughts mostly. @jbartosik still to the rescue for more clarity and preferences. 🙂
EDIT: had to remind myself that VPA is versioned separately from CA and based on its own CRD, so perhaps my second thought concerns aren't as concerning after all, and continuing with the PR would still be beneficial.
Thanks for sharing @voelzmo, that looks promising.
This PR and the VPA KEP were just created because, at the time, it seemed like this was never going to make it into the PDB spec, so I would actually be fine with closing it now. The short time that it would be useful would probably be outweighed by the effort it takes to maintain it. Only speaking for myself of course.
I'm happy to use PDB features instead of working around PDB in VPA. AlwaysAllow policy sounds like it does what this was intended to do (from https://github.com/kubernetes/kubernetes/pull/113375/):
// AlwaysAllow policy means that all running pods (status.phase="Running"), // but not yet healthy are considered disrupted and can be evicted regardless // of whether the criteria in a PDB is met. This means perspective running // pods of a disrupted application might not get a chance to become healthy. // Healthy pods will be subject to the PDB for eviction.
Ok, what about the KEP? I'm not familiar with the process, but I suppose it needs to be removed or updated to document this decision.
Ok, what about the KEP? I'm not familiar with the process, but I suppose it needs to be removed or updated to document this decision.
I added an item about this to the next SIG meeting
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the PR is closed
You can:
- Mark this PR as fresh with
/remove-lifecycle stale - Close this PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/close
@avorima: Closed this PR.
In response to this:
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.