cluster-api MachinePool phase stuck in Failed state even if InfrastructureRef is healthy

What steps did you take and what happened: When reconciling a MachinePool, if the InfrastructureRef object encounters an error during reconciliation in reconcileExternal(), the machine pool controller sets the MP fields Status.FailureReason and Status.FailureMessage here: https://github.com/kubernetes-sigs/cluster-api/blob/main/exp/internal/controllers/machinepool_controller_phases.go#L149

Later on when the controller calls reconcilePhase() it will check if either of these fields are non-nil. If they are, the MP phase is set to Failed: https://github.com/kubernetes-sigs/cluster-api/blob/main/exp/internal/controllers/machinepool_controller_phases.go#L81

However, if later on the underlying InfrastructureRef object reconciles successfully and is in a healthy / running state, the machine pool controller does not clear the MP fields Status.FailureReason / Status.FailureMessage in reconcileExternal() so reconcilePhase() will erroneously mark the phase as Failed. Since these fields never get cleared, the MP can never get out of phase=Failed even if all the underling infra is running fine.

What did you expect to happen: MachinePool controller can clear the Status.FailureReason and Status.FailureMessage fields so that machine pools can recover from phase=Failed if the underlying infra is no longer returning errors during reconcileExternal().

Anything else you would like to add: Confirmed that the MP status fields FailureReason/FailureMessage not getting cleared is the issue. When I manually removed the field Status.FailureMessage the subsequent reconciliation completed successfully and set the MP phase to "Running".

Environment:

Cluster-api version: v1.0.0
minikube/kind version:
Kubernetes version: (use kubectl version): 1.22
OS (e.g. from /etc/os-release):

/kind bug [One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

Oct 03 '22 19:10 dantiandb

/remove-kind bug /kind support /triage accepted

Status.FailureReason / Status.FailureMessage must be used for terminal failures only, so it is not expected/supported this error to go away. See discussion on https://github.com/kubernetes-sigs/cluster-api/issues/7191

Oct 04 '22 09:10 fabriziopandini

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 02 '23 09:01 k8s-triage-robot

/close due to inactivity

Jan 02 '23 10:01 fabriziopandini

@fabriziopandini: Closing this issue.

In response to this:

/close due to inactivity

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jan 02 '23 10:01 k8s-ci-robot