autoscaler icon indicating copy to clipboard operation
autoscaler copied to clipboard

Adds feature to crashlooping delete pods after evition fails

Open RuriRyan opened this issue 3 years ago • 15 comments

Which component this PR applies to?

vertical-pod-autoscaler

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds a new experimental feature, which can be enabled via a command line flag which attempts a pod deletion should the eviction fail for whatever reason. The deletion is bound to various conditions to not blindly delete a pod where the eviction failure might be justified. In general the deletion only occurs if at least one container was OOMKilled and is currently in CrashLoopBackoff and the number of restarts exceeeds the configured threshold (also behind a flag).

Which issue(s) this PR fixes:

Fixes #4730

Special notes for your reviewer:

Does this PR introduce a user-facing change?

New feature to delete crashlooping pods after eviction fails

RuriRyan avatar May 19 '22 09:05 RuriRyan

Welcome @RuriRyan!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. :smiley:

k8s-ci-robot avatar May 19 '22 09:05 k8s-ci-robot

Thanks. I think it would be good to have an enhancement proposal for this.

In the enhancement proposal I'd like to see:

  • Short explanation why we want this
  • Explanation why do this in VPA instead of in PDB
  • Description of the implementation

jbartosik avatar May 19 '22 10:05 jbartosik

I was writing in hurry. Let me try once more.

Thanks for sharing this. I'll take a look when I have some time (I expect next week).

Reading the PR will help me understand how much complexity implementing this would add to VPA (which is one concern).

I still would like to understand in some more detail why implementing this as a part of PDB is a problem. If we could have support for evicting pods in PDB it would be better. But that's "if". On the other hand if implementing this in PDB is not an option I'd like to know why but I guess it makes sense to make an improvement in VPA.

jbartosik avatar May 19 '22 16:05 jbartosik

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: RuriRyan Once this PR has been reviewed and has the lgtm label, please assign jbartosik for approval by writing /assign @jbartosik in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Jul 25 '22 10:07 k8s-ci-robot

/kind api-change

jbartosik avatar Jul 25 '22 12:07 jbartosik

/kind documentation

jbartosik avatar Jul 25 '22 12:07 jbartosik

LGMT I think I need someone who knows more about API to take a look too

jbartosik avatar Jul 25 '22 12:07 jbartosik

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

k8s-triage-robot avatar Jul 25 '22 12:07 k8s-triage-robot

I meant to leave those comments on the PR with KEP

jbartosik avatar Jul 25 '22 12:07 jbartosik

@jbartosik I 'm considering this as "done" now and am patiently awaiting your review :D

RuriRyan avatar Sep 01 '22 15:09 RuriRyan

/hold

While I'm working on VPA 0.12.0 I don't want to make any changes to VPA that are not related to the release (I think code is good to go but E2E tests are timing out).

I'll remove the hold after I release VPA 0.12.0.

jbartosik avatar Sep 02 '22 13:09 jbartosik

/hold cancel

jbartosik avatar Sep 20 '22 11:09 jbartosik

@RuriRyan I see this has a few open comments, mostly small things.

jbartosik avatar Oct 10 '22 11:10 jbartosik

@RuriRyan I see this has a few open comments, mostly small things.

@avorima Can you take over? :D

Fyi: I have left ionos and no longer have access to this Code.

RuriRyan avatar Oct 10 '22 14:10 RuriRyan

@jbartosik AFAICT all open review comments have either been fixed or answered

avorima avatar Oct 10 '22 14:10 avorima

We'd also love to see this PR move forward and eventually get merged.

/cc @jbartosik

timoreimann avatar Oct 26 '22 12:10 timoreimann

@timoreimann looks good to me

There's one thing remaining - checking if this works end to end.

Did you run any (manual?) tests to verify this works as intended?

Can you provide an e2e test that will ensure this keeps working?

I thought that I could take OOM test scenario and modify it:

  1. Same scenario but with PDB preventing VPA from evicting any pods
  2. Same scenario with PDB and VPA allowed to delete, verify that recommendation is applied

I'm not sure when I'll be able to do that. I'm worried about merging this without any verification that this works as intended end to end

jbartosik avatar Oct 27 '22 12:10 jbartosik

@timoreimann I want to merge this but need some verification that this works e2e. I see the following options:

  1. Someone writes e2e sends PR so that we can merge them at the same time
  2. Someone verifies this works e2e manually, we merge then add e2e test later

I can contribute e2e tests but I'm not sure when. I don't think 2. is significantly faster than 1.

jbartosik avatar Oct 27 '22 12:10 jbartosik

Writing an e2e tests sounds like the right thing to do. I'm happy to look into that when I have some spare time unless someone beats me to it.

timoreimann avatar Oct 28 '22 19:10 timoreimann

I just saw unhealthyPodEvictionPolicy was added for PDBs (see KEP and API change in PDB).

We also saw the original issue happen quite a few times and would want to get this fixed, but the change to PDB seems to solve the problem in a more "k8s native" way instead of work-arounding with the delete call.

@jbartosik @avorima @timoreimann wdyt?

voelzmo avatar Nov 18 '22 10:11 voelzmo

@voelzmo thanks for sharing the KEP. I wasn't aware of it, this does sound like the right approach going forward.

That said, I think there's value in driving this PR to completion regardless: the solution in the KEP will only be available in Kubernetes 1.26 in alpha. Users on older versions will still want to have some kind of solution (at least we do given a large portion of our fleet is affected by the problem regularly). It also seems possible to transition from what this PR proposes to what the KEP outlines by checking for the new field in the PDB once available and honor its value with higher precedence. At least, that's how I understand it should work.

Just my 2 cents of course, final call is to be made by @jbartosik. Coincidentally, I have started working on extending the end-to-end tests as proposed just earlier today, so I could probably add the last missing piece fairly soon.

timoreimann avatar Nov 18 '22 10:11 timoreimann

On second thought, the feature from this PR will likely only be backported so far (if at all), and / or more recent versions of VPA may be incompatible with older Kubernetes versions? So maybe there is only so much value users of older Kubernetes releases could get out of this PR?

I'm not super familiar with VPA's / CA's release policy and compatibility guarantees, so throwing out additional thoughts mostly. @jbartosik still to the rescue for more clarity and preferences. 🙂

EDIT: had to remind myself that VPA is versioned separately from CA and based on its own CRD, so perhaps my second thought concerns aren't as concerning after all, and continuing with the PR would still be beneficial.

timoreimann avatar Nov 18 '22 11:11 timoreimann

Thanks for sharing @voelzmo, that looks promising.

This PR and the VPA KEP were just created because, at the time, it seemed like this was never going to make it into the PDB spec, so I would actually be fine with closing it now. The short time that it would be useful would probably be outweighed by the effort it takes to maintain it. Only speaking for myself of course.

avorima avatar Nov 18 '22 16:11 avorima

I'm happy to use PDB features instead of working around PDB in VPA. AlwaysAllow policy sounds like it does what this was intended to do (from https://github.com/kubernetes/kubernetes/pull/113375/):

// AlwaysAllow policy means that all running pods (status.phase="Running"), // but not yet healthy are considered disrupted and can be evicted regardless // of whether the criteria in a PDB is met. This means perspective running // pods of a disrupted application might not get a chance to become healthy. // Healthy pods will be subject to the PDB for eviction.

jbartosik avatar Nov 22 '22 14:11 jbartosik

Ok, what about the KEP? I'm not familiar with the process, but I suppose it needs to be removed or updated to document this decision.

avorima avatar Nov 22 '22 15:11 avorima

Ok, what about the KEP? I'm not familiar with the process, but I suppose it needs to be removed or updated to document this decision.

I added an item about this to the next SIG meeting

jbartosik avatar Nov 23 '22 13:11 jbartosik

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Feb 21 '23 14:02 k8s-triage-robot

/close

avorima avatar Feb 21 '23 15:02 avorima

@avorima: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Feb 21 '23 15:02 k8s-ci-robot