cluster-api icon indicating copy to clipboard operation
cluster-api copied to clipboard

🌱 clusterctl: add flag to skip lagging provider check in ApplyCustomPlan

Open w21froster opened this issue 1 year ago • 10 comments

What this PR does / why we need it:

Clusterctl runs a pre-check to see if any other providers are lagging behind the target contract before creating an upgrade plan. In the current implementation of cluster-api-operator, there are multiple controllers reconciling on each different provider type. Each one of these controllers doesn't have knowledge of the other providers, and doesn't pass in enough information to clusterctl to be able to complete this check successfully. This PR is adds a flag and UpgradeOption to allow us to skip this pre-check and successfully upgrade the provider.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged): This fixes issue 570 in the cluster-api-operator repo.

w21froster avatar Sep 18 '24 23:09 w21froster

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign chrischdi for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Sep 18 '24 23:09 k8s-ci-robot

Welcome @w21froster!

It looks like this is your first PR to kubernetes-sigs/cluster-api 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cluster-api has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. :smiley:

k8s-ci-robot avatar Sep 18 '24 23:09 k8s-ci-robot

Hi @w21froster. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Sep 18 '24 23:09 k8s-ci-robot

/area clusterctl

w21froster avatar Sep 18 '24 23:09 w21froster

@JoelSpeed @Jont828 Please take a look when you are available 🙏

w21froster avatar Sep 23 '24 15:09 w21froster

@JoelSpeed @Jont828 Are you able to take a look? Let me know if you need more context on anything.

w21froster avatar Oct 07 '24 20:10 w21froster

Question: Could adding this and using it in the cluster-api-operator lead to issues?

Could it be possible to have providers then running in different contract versions which could maybe lead to issues?

Upgrading using clusterctl upgrades all providers at the same time instead of each one in parallel (so some could still be running while others are already upgraded).

chrischdi avatar Oct 08 '24 18:10 chrischdi

Question: Could adding this and using it in the cluster-api-operator lead to issues?

Could it be possible to have providers then running in different contract versions which could maybe lead to issues?

Upgrading using clusterctl upgrades all providers at the same time instead of each one in parallel (so some could still be running while others are already upgraded).

I don't think this should be an issue, we talked about it in the cluster-api-operator office hours and determined that this was probably the best way forward to add a flag in clusterctl to skip this check. We have different CR's for each provider, and when users upgrade their providers they typically move all versions at the same time. I guess there could potentially be a delay between reconciliation for each provider, but we haven't noticed any issues running this as a fork and upgrading Azure CAPI/CAPBK/KCP providers.

Definitely open to better approaches though! I can stop by the CAPI office hours to discuss this issue we are having in more detail.

w21froster avatar Oct 11 '24 01:10 w21froster

I personally have some concern on disabling this check, considering that the value added of clusterctl is to ensure the health of the management cluster as whole.

TBH, I think that if someone asks to the operator to upgrade a single provider, this operation must be put on hold if it can lead to an invalid cluster (leaning on "when users upgrade their providers they typically move all versions at the same time" seems weak).

The upgrade operation for the providers involved should unblock itself when the users is upgrading enough providers to reach a valid state.

The issues seems to be in "Each one of these controllers doesn't have knowledge of the other providers, and doesn't pass in enough information to clusterctl to be able to complete this check successfully", but I think there are ways to get around since AFAIK for each provider there is a CR with a desired state/target version

fabriziopandini avatar Oct 18 '24 09:10 fabriziopandini

Hey @fabriziopandini, sorry for the delayed response. Thank you for providing more context on this check. We don't want users to be able to break their cluster if they have a misconfiguration, so I think a PR should be made in the CAPI operator instead of CAPI to get this to pass. I will go ahead and close this PR

w21froster avatar Nov 05 '24 19:11 w21froster