cluster-api :seedling: Bubble up machines permanent failure as MS/MD conditions

What this PR does / why we need it: There's a UX lack to signal permanent machine failures in MachineSet and MachineDeployments. This PR solves that by bubbling up permanent Machine failures aka FailureMessage/FailureReason as a MachineSet and MachineDeployment condition: MachinesSucceededCondition.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged): Fixes https://github.com/kubernetes-sigs/cluster-api/issues/5635

Needs https://github.com/kubernetes-sigs/cluster-api/pull/6025 for the conditions to be set in updateStatus

Feb 28 '22 21:02 enxebre

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: To complete the pull request process, please ask for approval from enxebre after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Feb 28 '22 21:02 k8s-ci-robot

/hold

Feb 28 '22 21:02 enxebre

PTAL @vincepri @fabriziopandini @sbueringer @CecileRobertMichon @Arvinderpal If we are happy with the approach I'll try ti get the other PR merged, then will polish this one.

Feb 28 '22 21:02 enxebre

@enxebre first of all thanks for giving a go this problem, users can really benefit from it if we can get to an agreement.

I'm personally in favor of simplifying the user experience as much as possible, and thus:

Use conditions to represent the current state and deprecate/remove failureMessage and failureReason
Merge the evidence from failureMessage and failureReason into existing conditions, under the assumption that that failure happens during part of the process that is already surfaced by an existing condition

e.g. if in CAPA there is a permanent failure while provisioning the VPC, it should surface into the VCPReady condition, status false, severity error.

I think that having a separated condition could be really confusing for the user because in the example above we will have MachineSuccesful reporting a fatal error provisioning VPC, but VPCReady will report something else.

However unfortunately there is no agreement on this topic, as per https://github.com/kubernetes-sigs/cluster-api/issues/3692, If I got it right mostly due to how we should treat these conditions as permanent.

Might be a possible way out is to introduce a new severity level called "fatal", and make sure that util/conditions treat it as immutable/highest priority...

Mar 02 '22 12:03 fabriziopandini

Use conditions to represent the current state and deprecate/remove failureMessage and failureReason Merge the evidence from failureMessage and failureReason into existing conditions, under the assumption that that failure happens during part of the process that is already surfaced by an existing condition e.g. if in CAPA there is a permanent failure while provisioning the VPC, it should surface into the VCPReady condition, status false, severity error. I think that having a separated condition could be really confusing for the user because in the example above we will have MachineSuccesful reporting a fatal error provisioning VPC, but VPCReady will report something else.

I'm not seeing how it'd be feasible to have different Machine conditions for every single permanent failure for every single provider and how to signal that in MS/MD without a common single provider agnostic condition for permanent failures. The common factor is that the failure is permanent so having a common semantic to express that i.e failureMessage/Reason seems reasonable to me. Now, to bubble permanent failures up to MachineSet/MachineDeployment we need a single known source of truth (failureMessage/Reason) and a single provider agnostic condition in MS/MD to signal, hence the introduction of MachinesSucceededCondition (I didn't choose MachinesFailureCondition to align with positive nuance of all other conditions), agree?

So my proposal would be:

Keep failureMessage/Reason in infraMachines.
Introduce a new condition for MS/MD to bubble up and signal permanent failures, i.e MachinesSucceededCondition.
Now, if we want revisit moving failureMessage/Reason to conditions in infraMachines we can do transparently any time as far as we satisfy the contract with MS/MD and the failures are bubble up through MachinesSucceededCondition.

Thoughts?

Mar 02 '22 16:03 enxebre

Discussed in Community meeting Wed 2 Mar 2022 that using particular ConditionSeverities would be preferred over perpetuating failureMessage in infraMachines.

The workflow could look as follows:

Machine controller bubbles up conditions with severity x from infraMachines into Machines.
Upper level controller MS can then lookup these conditions by severity and set a condition "c" with a SetAggregate logic for all owned Machines.
Upper level controllers MD can bubble up "c" from MachineSets.
Deprecation of failureMessage in infraMachines while providers are given time to satisfy the severity contract.

@fabriziopandini @yastij does this align with your thoughts or were you thinking something different?

Mar 02 '22 22:03 enxebre

A question for my understanding

Machine controller bubbles up conditions with severity x from infraMachines into Machines.

Does this mean we take the InfraMachine condition and use it 1:1 in Machine, i.e. we would e.g. get an AWS InstanceReady condition into the conditions array of the Machine and then bubble the same condition up 1:1 to MachineSet etc?

Or alternatively, we look at the conditions from the InfraMachine and aggregate them into a single condition in the Machine and bubble that directly up / or aggregate it further when bubbling up.

Can you maybe make an example, how a specific failure condition of a specific provider would be bubbled up / reflected in the Machine/MachineSet/MachineDeployment.

Apologies if that was already discussed and I just missed it. Just having a hard-time figuring out how this would exactly work.

Mar 03 '22 07:03 sbueringer

@enxebre @sbueringer I have written some notes about how terminal failures should work at the end of https://docs.google.com/document/d/1hBQnWWa5d16FOslNhDwYVOhcMjLIul4tMeUgh4maI3w/edit#

Happy to chat about it and to rally to make it an amendment to the condition proposal WRT to the current PR/Issue, I think it is fair the ask to have failureReason/failureMessage to bubble up Machine failure (until those fields exists, they should behave properly). However I would decouple this fix by the work described in the doc that is more long-term

Mar 14 '22 17:03 fabriziopandini

this looks stalled. how can we move this forward?

May 06 '22 18:05 jdef

this looks stalled. how can we move this forward? Let's try to find consensus on the doc and move forward with https://github.com/kubernetes-sigs/cluster-api/pull/6025 first

May 09 '22 08:05 enxebre

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Aug 07 '22 09:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Sep 06 '22 09:09 k8s-triage-robot

/remove-lifecycle rotten

Sep 06 '22 11:09 jdef

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Dec 05 '22 12:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Jan 04 '23 13:01 k8s-triage-robot

@vincepri: Closed this PR.

In response to this:

Closing this due to inactivity

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jan 23 '23 20:01 k8s-ci-robot

cluster-api cluster-api copied to clipboard

:seedling: Bubble up machines permanent failure as MS/MD conditions

cluster-api
cluster-api copied to clipboard