machine-controller-manager
machine-controller-manager copied to clipboard
MCM does not reset `.status.failedMachines` of MachineDeployment
What happened:
MCM does not update the .status.failedMachine of the MachineDeployment after the .status.lastOperation of the Machine changes (e.g., from Failed -> Processing (e.g., after the credentials have been fixed)):
status:
availableReplicas: 2
conditions:
- lastTransitionTime: "2020-04-29T06:51:38Z"
lastUpdateTime: "2020-04-29T06:51:38Z"
message: Deployment does not have minimum availability.
reason: MinimumReplicasUnavailable
status: "False"
type: Available
failedMachines:
- lastOperation:
description: 'Failed to list VMs while deleting the machine "shoot--foo--bar-cpu-worker-z1-5cdcb46f64-pxzp5"
AuthFailure: AWS was not able to validate the provided access credentials
status code: 401, request id: 6e99231c-654e-4b05-8801-310e3532b4e9'
lastUpdateTime: "2020-04-29T06:53:33Z"
state: Failed
type: Delete
name: shoot--foo--bar-cpu-worker-z1-5cdcb46f64-pxzp5
ownerRef: shoot--foo--bar-cpu-worker-z1-5cdcb46f64
observedGeneration: 2
spec:
class:
kind: AWSMachineClass
name: shoot--foo--bar-cpu-worker-z1-ff76e
nodeTemplate:
metadata:
creationTimestamp: null
labels:
node.kubernetes.io/role: node
worker.garden.sapcloud.io/group: cpu-worker
worker.gardener.cloud/pool: cpu-worker
spec: {}
providerID: aws:///eu-west-1/i-05f4737c3ef646f89
status:
currentStatus:
lastUpdateTime: "2020-04-29T07:41:44Z"
phase: Pending
timeoutActive: true
lastOperation:
description: Creating machine on cloud provider
lastUpdateTime: "2020-04-29T07:41:44Z"
state: Processing
type: Create
node: ip-10-250-9-55.eu-west-1.compute.internal
(compare the timestamps)
What you expected to happen:
The .status.failedMachines is properly updated when .status.lastOperation of Machine objects are changed.
Dupe of https://github.com/gardener/machine-controller-manager/issues/476 . Closing in favor of the other.
/close
#476 is about the failed machine metric counter, whereas this issue is about reseting the status field. These appear to be different issues, and hence are not duplicate.
Reopening this.
@rfranzke currently the .status.failedMachines field in the machine deployment is reset only when all the machines controlled by the machine deployment are healthy (running and have joined the cluster). Until then it contains the latest set of failed machine operations.
Refer:
https://github.com/gardener/machine-controller-manager/blob/e21b931ce25cc57b4267fce80e69839179854c20/pkg/controller/deployment_machineset_util.go#L128-L135
Otherwise for some of the fail scenarios it can get difficult to catch the failed operations and reasons because of very quick create and delete operations. As a quick experiment, return a random failure here: https://github.com/gardener/machine-controller-manager/blob/e21b931ce25cc57b4267fce80e69839179854c20/pkg/controller/machine_bootstrap_token.go#L56
In the observation reported in the issue, once the machine joins and the machine deployment becomes healthy (all machines join the cluster), the failedMachines field would have been cleared.
Please let me know if there is any issue created because of current behaviour. We can take corrective measures accordingly. Thanks.
I still think this should be improved, i.e., the status should reflect the actual state of the world (which it currently does not - the machine is in the processing of joining the cluster and no longer failed). It's not too critical/important in the sense that we need it tomorrow, but it would improve user and ops experience/productivity to not show false negative status information.