MachinePool phase stuck in Failed state even if InfrastructureRef is healthy
What steps did you take and what happened:
When reconciling a MachinePool, if the InfrastructureRef object encounters an error during reconciliation in reconcileExternal(), the machine pool controller sets the MP fields Status.FailureReason and Status.FailureMessage here:
https://github.com/kubernetes-sigs/cluster-api/blob/main/exp/internal/controllers/machinepool_controller_phases.go#L149
Later on when the controller calls reconcilePhase() it will check if either of these fields are non-nil. If they are, the MP phase is set to Failed:
https://github.com/kubernetes-sigs/cluster-api/blob/main/exp/internal/controllers/machinepool_controller_phases.go#L81
However, if later on the underlying InfrastructureRef object reconciles successfully and is in a healthy / running state, the machine pool controller does not clear the MP fields Status.FailureReason / Status.FailureMessage in reconcileExternal() so reconcilePhase() will erroneously mark the phase as Failed. Since these fields never get cleared, the MP can never get out of phase=Failed even if all the underling infra is running fine.
What did you expect to happen:
MachinePool controller can clear the Status.FailureReason and Status.FailureMessage fields so that machine pools can recover from phase=Failed if the underlying infra is no longer returning errors during reconcileExternal().
Anything else you would like to add:
Confirmed that the MP status fields FailureReason/FailureMessage not getting cleared is the issue. When I manually removed the field Status.FailureMessage the subsequent reconciliation completed successfully and set the MP phase to "Running".
Environment:
- Cluster-api version: v1.0.0
- minikube/kind version:
- Kubernetes version: (use
kubectl version): 1.22 - OS (e.g. from
/etc/os-release):
/kind bug [One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]
/remove-kind bug /kind support /triage accepted
Status.FailureReason / Status.FailureMessage must be used for terminal failures only, so it is not expected/supported this error to go away. See discussion on https://github.com/kubernetes-sigs/cluster-api/issues/7191
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/close due to inactivity
@fabriziopandini: Closing this issue.
In response to this:
/close due to inactivity
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.