cluster-api
cluster-api copied to clipboard
:seedling: Bubble up machines permanent failure as MS/MD conditions
What this PR does / why we need it: There's a UX lack to signal permanent machine failures in MachineSet and MachineDeployments. This PR solves that by bubbling up permanent Machine failures aka FailureMessage/FailureReason as a MachineSet and MachineDeployment condition: MachinesSucceededCondition.
Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes https://github.com/kubernetes-sigs/cluster-api/issues/5635
Needs https://github.com/kubernetes-sigs/cluster-api/pull/6025 for the conditions to be set in updateStatus
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: To complete the pull request process, please ask for approval from enxebre after the PR has been reviewed.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
/hold
PTAL @vincepri @fabriziopandini @sbueringer @CecileRobertMichon @Arvinderpal If we are happy with the approach I'll try ti get the other PR merged, then will polish this one.
@enxebre first of all thanks for giving a go this problem, users can really benefit from it if we can get to an agreement.
I'm personally in favor of simplifying the user experience as much as possible, and thus:
- Use conditions to represent the current state and deprecate/remove failureMessage and failureReason
- Merge the evidence from failureMessage and failureReason into existing conditions, under the assumption that that failure happens during part of the process that is already surfaced by an existing condition
e.g. if in CAPA there is a permanent failure while provisioning the VPC, it should surface into the VCPReady condition, status false, severity error.
I think that having a separated condition could be really confusing for the user because in the example above we will have MachineSuccesful reporting a fatal error provisioning VPC, but VPCReady will report something else.
However unfortunately there is no agreement on this topic, as per https://github.com/kubernetes-sigs/cluster-api/issues/3692, If I got it right mostly due to how we should treat these conditions as permanent.
Might be a possible way out is to introduce a new severity level called "fatal", and make sure that util/conditions treat it as immutable/highest priority...
Use conditions to represent the current state and deprecate/remove failureMessage and failureReason Merge the evidence from failureMessage and failureReason into existing conditions, under the assumption that that failure happens during part of the process that is already surfaced by an existing condition e.g. if in CAPA there is a permanent failure while provisioning the VPC, it should surface into the VCPReady condition, status false, severity error. I think that having a separated condition could be really confusing for the user because in the example above we will have MachineSuccesful reporting a fatal error provisioning VPC, but VPCReady will report something else.
I'm not seeing how it'd be feasible to have different Machine conditions for every single permanent failure for every single provider and how to signal that in MS/MD without a common single provider agnostic condition for permanent failures. The common factor is that the failure is permanent so having a common semantic to express that i.e failureMessage/Reason seems reasonable to me. Now, to bubble permanent failures up to MachineSet/MachineDeployment we need a single known source of truth (failureMessage/Reason) and a single provider agnostic condition in MS/MD to signal, hence the introduction of MachinesSucceededCondition (I didn't choose MachinesFailureCondition to align with positive nuance of all other conditions), agree?
So my proposal would be:
- Keep failureMessage/Reason in infraMachines.
- Introduce a new condition for MS/MD to bubble up and signal permanent failures, i.e MachinesSucceededCondition.
- Now, if we want revisit moving failureMessage/Reason to conditions in infraMachines we can do transparently any time as far as we satisfy the contract with MS/MD and the failures are bubble up through MachinesSucceededCondition.
Thoughts?
Discussed in Community meeting Wed 2 Mar 2022 that using particular ConditionSeverities would be preferred over perpetuating failureMessage in infraMachines.
The workflow could look as follows:
- Machine controller bubbles up conditions with severity x from infraMachines into Machines.
- Upper level controller MS can then lookup these conditions by severity and set a condition "c" with a SetAggregate logic for all owned Machines.
- Upper level controllers MD can bubble up "c" from MachineSets.
- Deprecation of failureMessage in infraMachines while providers are given time to satisfy the severity contract.
@fabriziopandini @yastij does this align with your thoughts or were you thinking something different?
A question for my understanding
Machine controller bubbles up conditions with severity x from infraMachines into Machines.
Does this mean we take the InfraMachine condition and use it 1:1 in Machine, i.e. we would e.g. get an AWS InstanceReady condition into the conditions array of the Machine and then bubble the same condition up 1:1 to MachineSet etc?
Or alternatively, we look at the conditions from the InfraMachine and aggregate them into a single condition in the Machine and bubble that directly up / or aggregate it further when bubbling up.
Can you maybe make an example, how a specific failure condition of a specific provider would be bubbled up / reflected in the Machine/MachineSet/MachineDeployment.
Apologies if that was already discussed and I just missed it. Just having a hard-time figuring out how this would exactly work.
@enxebre @sbueringer I have written some notes about how terminal failures should work at the end of https://docs.google.com/document/d/1hBQnWWa5d16FOslNhDwYVOhcMjLIul4tMeUgh4maI3w/edit#
Happy to chat about it and to rally to make it an amendment to the condition proposal WRT to the current PR/Issue, I think it is fair the ask to have failureReason/failureMessage to bubble up Machine failure (until those fields exists, they should behave properly). However I would decouple this fix by the work described in the doc that is more long-term
this looks stalled. how can we move this forward?
this looks stalled. how can we move this forward? Let's try to find consensus on the doc and move forward with https://github.com/kubernetes-sigs/cluster-api/pull/6025 first
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
@vincepri: Closed this PR.
In response to this:
Closing this due to inactivity
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.