machine-controller-manager MCM does not reset `.status.failedMachines` of MachineDeployment

What happened: MCM does not update the .status.failedMachine of the MachineDeployment after the .status.lastOperation of the Machine changes (e.g., from Failed -> Processing (e.g., after the credentials have been fixed)):

  status:
    availableReplicas: 2
    conditions:
    - lastTransitionTime: "2020-04-29T06:51:38Z"
      lastUpdateTime: "2020-04-29T06:51:38Z"
      message: Deployment does not have minimum availability.
      reason: MinimumReplicasUnavailable
      status: "False"
      type: Available
    failedMachines:
    - lastOperation:
        description: 'Failed to list VMs while deleting the machine "shoot--foo--bar-cpu-worker-z1-5cdcb46f64-pxzp5"
          AuthFailure: AWS was not able to validate the provided access credentials
          status code: 401, request id: 6e99231c-654e-4b05-8801-310e3532b4e9'
        lastUpdateTime: "2020-04-29T06:53:33Z"
        state: Failed
        type: Delete
      name: shoot--foo--bar-cpu-worker-z1-5cdcb46f64-pxzp5
      ownerRef: shoot--foo--bar-cpu-worker-z1-5cdcb46f64
    observedGeneration: 2

  spec:
    class:
      kind: AWSMachineClass
      name: shoot--foo--bar-cpu-worker-z1-ff76e
    nodeTemplate:
      metadata:
        creationTimestamp: null
        labels:
          node.kubernetes.io/role: node
          worker.garden.sapcloud.io/group: cpu-worker
          worker.gardener.cloud/pool: cpu-worker
      spec: {}
    providerID: aws:///eu-west-1/i-05f4737c3ef646f89
  status:
    currentStatus:
      lastUpdateTime: "2020-04-29T07:41:44Z"
      phase: Pending
      timeoutActive: true
    lastOperation:
      description: Creating machine on cloud provider
      lastUpdateTime: "2020-04-29T07:41:44Z"
      state: Processing
      type: Create
    node: ip-10-250-9-55.eu-west-1.compute.internal

(compare the timestamps)

What you expected to happen: The .status.failedMachines is properly updated when .status.lastOperation of Machine objects are changed.

Apr 29 '20 07:04 rfranzke

Dupe of https://github.com/gardener/machine-controller-manager/issues/476 . Closing in favor of the other.

Aug 14 '20 03:08 hardikdr

/close

Aug 14 '20 03:08 hardikdr

#476 is about the failed machine metric counter, whereas this issue is about reseting the status field. These appear to be different issues, and hence are not duplicate. Reopening this.

Aug 14 '20 05:08 ggaurav10

@rfranzke currently the .status.failedMachines field in the machine deployment is reset only when all the machines controlled by the machine deployment are healthy (running and have joined the cluster). Until then it contains the latest set of failed machine operations. Refer:

https://github.com/gardener/machine-controller-manager/blob/e21b931ce25cc57b4267fce80e69839179854c20/pkg/controller/deployment_machineset_util.go#L128-L135

Otherwise for some of the fail scenarios it can get difficult to catch the failed operations and reasons because of very quick create and delete operations. As a quick experiment, return a random failure here: https://github.com/gardener/machine-controller-manager/blob/e21b931ce25cc57b4267fce80e69839179854c20/pkg/controller/machine_bootstrap_token.go#L56

In the observation reported in the issue, once the machine joins and the machine deployment becomes healthy (all machines join the cluster), the failedMachines field would have been cleared.

Please let me know if there is any issue created because of current behaviour. We can take corrective measures accordingly. Thanks.

Aug 14 '20 05:08 ggaurav10

I still think this should be improved, i.e., the status should reflect the actual state of the world (which it currently does not - the machine is in the processing of joining the cluster and no longer failed). It's not too critical/important in the sense that we need it tomorrow, but it would improve user and ops experience/productivity to not show false negative status information.

Aug 14 '20 05:08 rfranzke

machine-controller-manager machine-controller-manager copied to clipboard

MCM does not reset `.status.failedMachines` of MachineDeployment

machine-controller-manager
machine-controller-manager copied to clipboard