machine-controller-manager
machine-controller-manager copied to clipboard
Error when updating Machine status
How to categorize this issue?
/area robustness /kind bug /priority 3
What happened: The machine status cannot be updated for the following reason:
Update failed for machine "shoot--dev--shoot-cpu-worker-z1-ff5c8-48vbl". Retrying, error: "Machine.machine.sapcloud.io \"shoot--dev--shoot-cpu-worker-z1-ff5c8-48vbl\" is invalid: status.conditions.lastHeartbeatTime: Invalid value: \"null\": status.conditions.lastHeartbeatTime in body must be of type string: \"null\""
The issue disappears as soon as the lastHeartbeatTime of the respective node conditions are maintained.
What you expected to happen: The status update should work from the beginning.
How to reproduce it (as minimally and precisely as possible):
- Apply the CRD definition of the main branch (e096125464b441d9ff403803c285ed206877d08c)
- Create a new
Machine
Possible soln Patch() seems to solve the issue rather than UpdateStatus() or Update() . Look at https://github.com/gardener/gardener/issues/5376
@timuthy Label area/todo does not exist.
@himanshu-kun Label area/todo does not exist.
Btw I managed to find this issue today that seems loosely relevant https://github.com/kubernetes/kubernetes/issues/99804 Also can we try to reproduce this error for the MCM on kubernetes 1.22.2 (as I could not reproduce the error from #5376 on it), then we know that this depends on the kubernetes version
Shouldn't the metav1.Time fields in MachineStatus be pointers if they are tagged with omitempty?
Dug up this issue as well: https://github.com/kubernetes/kubernetes/issues/86811 which I guess is the same problem. This comment suggest the same as @timebertt
As far as I understand from this comment the issue should be fixed in k/k 1.20 (I only managed to reproduce this issue on 1.18 and 1.19) Although, I don't think its good enough for us, If this indeed is only fixed from 1.20 onwards :/
Update:
Read a bit more and it seems like we can add the // +nullable to the metav1.Time fields, going to test it out real quick
Before updating the machine crd with one generated with // +nullable for the metav1.Time fields the following call returns the error from the issue (btw when metav1.Time is not set, it is represented as null when marshalled to json):
k gardener patch-status \
-n shoot--garden--migration3 machine shoot--garden--migration3-gardenlinux-z1-6fb56-q4hmx \
--type=json --patch='[{ "op": "replace", "path": "/status", "value": {"lastOperation":{"lastUpdateTime":null}}}]'
After updating the crd so that metav1.Time fields are nullable the above call successfully modifies the status.
I could not reproduce the issue in both cases when using --type=merge for the patch. E.g.:
k gardener patch-status \
-n shoot--garden--migration3 machine shoot--garden--migration3-gardenlinux-z1-6fb56-q4hmx \
--type=merge --patch='{"status":{"lastOperation":{"lastUpdateTime":null}}}'
/close The issue is only seen in control cluster with k8s version < 1.20 (https://github.com/kubernetes/kubernetes/issues/86811) . For g/g we have already moved to much latest releases of k8s. So we won't be fixing this. If you are using mcm without gardener , then switch you control cluster to latest k8s version.