kops icon indicating copy to clipboard operation
kops copied to clipboard

Add a flag to rolling update to fail immediately on IG error

Open jandersen-plaid opened this issue 2 years ago • 6 comments

Fixes #14176

Add a flag to kops rolling-update cluster that will exit the rolling update when the rolling update first encounters an error with an instancegroup that is normally tried in serial (either APIServer or Node).

I have added a unit test which should fail if ExitOnFirstError is set to false, but please let me know if there is additional documentation or testing that I should add.

jandersen-plaid avatar Aug 26 '22 14:08 jandersen-plaid

CLA Signed

The committers listed above are authorized under a signed CLA.

  • :white_check_mark: login: jandersen-plaid (40caf71d9d8dee8061aa13b5ee3dc9ac47b4114b)

Welcome @jandersen-plaid!

It looks like this is your first PR to kubernetes/kops 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kops has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. :smiley:

k8s-ci-robot avatar Aug 26 '22 14:08 k8s-ci-robot

Hi @jandersen-plaid. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Aug 26 '22 14:08 k8s-ci-robot

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign johngmyers for approval by writing /assign @johngmyers in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Aug 26 '22 14:08 k8s-ci-robot

I'd rather not add a flag for this. I think it is enough to inspect the returned error and return directly if it does not make sense to continue.

olemarkus avatar Aug 26 '22 18:08 olemarkus

@jandersen-plaid: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Sep 15 '22 23:09 k8s-ci-robot

I think I'd prefer this be the default or only option.

/hold /kind office-hours

johngmyers avatar Nov 24 '22 19:11 johngmyers

I believe the history was that the update on the IG would previously wait forever, not fail.

johngmyers avatar Nov 24 '22 19:11 johngmyers

If a control plane IG fails, we already directly return an error. That is by far the most important behavior. If an IG fails and kOps keeps going to the next, and keeps going to the next and continues to gracefully drain and terminate nodes makes sense.

But in the case of a validation error, it doesn't make sense to keep going as kOps won't succeed with the next IG either.

olemarkus avatar Nov 25 '22 18:11 olemarkus

/ok-to-test

johngmyers avatar Dec 03 '22 23:12 johngmyers

/retest

johngmyers avatar Dec 03 '22 23:12 johngmyers

/retest

johngmyers avatar Dec 04 '22 02:12 johngmyers

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: olemarkus

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Jan 09 '23 12:01 k8s-ci-robot

@johngmyers you still want to hold this one?

olemarkus avatar Jan 09 '23 12:01 olemarkus

/hold cancel

johngmyers avatar Jan 10 '23 05:01 johngmyers

/retest

johngmyers avatar Jan 10 '23 07:01 johngmyers