machine-controller-manager icon indicating copy to clipboard operation
machine-controller-manager copied to clipboard

Error when updating Machine status

Open timuthy opened this issue 4 years ago • 6 comments

How to categorize this issue?

/area robustness /kind bug /priority 3

What happened: The machine status cannot be updated for the following reason:

Update failed for machine "shoot--dev--shoot-cpu-worker-z1-ff5c8-48vbl". Retrying, error: "Machine.machine.sapcloud.io \"shoot--dev--shoot-cpu-worker-z1-ff5c8-48vbl\" is invalid: status.conditions.lastHeartbeatTime: Invalid value: \"null\": status.conditions.lastHeartbeatTime in body must be of type string: \"null\""

The issue disappears as soon as the lastHeartbeatTime of the respective node conditions are maintained.

What you expected to happen: The status update should work from the beginning.

How to reproduce it (as minimally and precisely as possible):

  • Apply the CRD definition of the main branch (e096125464b441d9ff403803c285ed206877d08c)
  • Create a new Machine

Possible soln Patch() seems to solve the issue rather than UpdateStatus() or Update() . Look at https://github.com/gardener/gardener/issues/5376

timuthy avatar Aug 24 '21 13:08 timuthy

@timuthy Label area/todo does not exist.

gardener-robot avatar Aug 24 '21 13:08 gardener-robot

@himanshu-kun Label area/todo does not exist.

gardener-robot avatar Feb 02 '22 11:02 gardener-robot

Btw I managed to find this issue today that seems loosely relevant https://github.com/kubernetes/kubernetes/issues/99804 Also can we try to reproduce this error for the MCM on kubernetes 1.22.2 (as I could not reproduce the error from #5376 on it), then we know that this depends on the kubernetes version

plkokanov avatar Feb 02 '22 12:02 plkokanov

Shouldn't the metav1.Time fields in MachineStatus be pointers if they are tagged with omitempty?

timebertt avatar Feb 03 '22 08:02 timebertt

Dug up this issue as well: https://github.com/kubernetes/kubernetes/issues/86811 which I guess is the same problem. This comment suggest the same as @timebertt

As far as I understand from this comment the issue should be fixed in k/k 1.20 (I only managed to reproduce this issue on 1.18 and 1.19) Although, I don't think its good enough for us, If this indeed is only fixed from 1.20 onwards :/

Update: Read a bit more and it seems like we can add the // +nullable to the metav1.Time fields, going to test it out real quick

plkokanov avatar Feb 03 '22 09:02 plkokanov

Before updating the machine crd with one generated with // +nullable for the metav1.Time fields the following call returns the error from the issue (btw when metav1.Time is not set, it is represented as null when marshalled to json):

k gardener patch-status \
  -n shoot--garden--migration3 machine shoot--garden--migration3-gardenlinux-z1-6fb56-q4hmx \
  --type=json --patch='[{ "op": "replace", "path": "/status", "value": {"lastOperation":{"lastUpdateTime":null}}}]'

After updating the crd so that metav1.Time fields are nullable the above call successfully modifies the status.

I could not reproduce the issue in both cases when using --type=merge for the patch. E.g.:

k gardener patch-status \
  -n shoot--garden--migration3 machine shoot--garden--migration3-gardenlinux-z1-6fb56-q4hmx  \
  --type=merge --patch='{"status":{"lastOperation":{"lastUpdateTime":null}}}'

plkokanov avatar Feb 03 '22 12:02 plkokanov

/close The issue is only seen in control cluster with k8s version < 1.20 (https://github.com/kubernetes/kubernetes/issues/86811) . For g/g we have already moved to much latest releases of k8s. So we won't be fixing this. If you are using mcm without gardener , then switch you control cluster to latest k8s version.

himanshu-kun avatar Feb 15 '23 05:02 himanshu-kun