machine-controller-manager
machine-controller-manager copied to clipboard
Update CrashLoopBackOff condition as soon as timeout expires
How to categorize this issue?
/area robustness /kind bug /priority 2
What happened:
Currently MCM doesn't turn CrashLoopBackoff(CLBF) machines to Failed as soon as creationTimeout expires , but delays have been observed which can range to 2min to any time.
There are 2 parts of the problem:
- timeout check for CLBF machine is done after making
CreateMachine()driver call .CreateMachine()itself could take any amount of time as it provisions VM on the cloudprovider (in Azure big delays like 30min or more have been seen before) - even if the 1) doesn't exist to contribute to the delay, the retry/re-push to the queue can also introduce the delay of
ShortRetry(3min). This happens because the retry period is not calculated considering thetime left before timeoutbut is just a constant value ofShortRetry(3min)
What you expected to happen:
Turn to Failed as soon as timeout expires.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
We could take inspiration from machineDeployment logic which turns Progressing condition to False with reason ProgressDeadlineExceeded as soon as the deadline exceeds.
Environment:
- Kubernetes version (use
kubectl version): - Cloud provider or hardware configuration:
- Others:
Grooming Discussion Results
- Our retry handling in machine creation is present in
machineCreateErrorHandlerinmachine_util.go. This method specifies a medium retry (3min) for enqueuing the machine object if the VM instance failed to be created. - The error handler method mentioned above leverages the helper method
getCreateFailurePhasewhich transitions the the machine phase:CrashLoopBackoff➡Failedon creation timeout. - We should ideally transition the machine phase to
Failedfaster if we are very close to the creation timeout and if the next retry added to the current time exceeds the creation timeout. - To ensure that the above works fine, we should establish that
Driver.CreateMachinereturns within a bounded time. We can do this by making sure there is a deadline on theContextpassed to theCreateMachinemethod. The deadline should be the time pending for us to reach machine creation timeout.
On further testing we saw that although this change is fine if MCM runs standalone, there is a scenario that when CA runs alongside this change can cause CAs backoff mechanism to not work.
The case we focussed on was CA backing off from a MachineDeployment if the Machine does not get ready and CA knows that there is another MachineDeployment whose Machine spec is sufficient and adequate for incoming workload. In normal circumstances, CA would backoff, scale down the MachineDeployment it had originally scaled up, and scale up a different MachineDeployment whose Machines would then be used.
With this change however (assuming MCMs --machine-creation-timeout and CAs --max-node-provision-time are the same), MCM will replace the Machine before CA has a chance to look at it and calculate it's backoff parameters. So from CAs point of view it won't see a node not in the Running state for longer than required for it to backoff from the MachineDeployment
Solution:
~* Set the --machine-creation-timeout in MCM to 0 by default so that unless set, MCM will not have a timeout for machine creation. Field can be set by the user in the shoot spec~
~* Document this behaviour and advice that MCMs --machine-creation-timeout should be a value greater then CAs --max-node-provision-time~
Updating the solution to be more specific and descriptive
-
Update
MCMto be able to interpret a negative or zero timeout value for--machine-creation-timeout. This means that if this flag is set to 0 or some negative value, thenMCMwill not have a timeout for machine creation. This does not change the default value for this flag, and if no value is passed,MCMwill still use it's current default -
Update
gardenletcode to set a negative value to--machine-creation-timeoutforMCMbe default if no value is set via the shoot spec. If a value is specified in the shoot spec, then the value specified is always used This behaviour needs to be documented and there needs to be a recommendation to setMCMs--machine-creation-timeoutvalue to be greater thanCAs--max-node-provision-timevalue. -
Validation needs to be added to the shoot to not allow
machineCreationTimeoutto be greater thanmaxNodeProvisionTime -
No additional change from CA is needed
Regarding Set the --machine-creation-timeout in MCM to 0 by default - is this OK for standalone consumers of the MCM ? Do we have any stakeholders where folks use the MCM but not our fork of the CA?
We should just add a doc that the default value is set to 0. There are 2 different recommendations:
- If MCM is used standalone without using CA then consumers should use a value > 0.
- If MCM is used along with CA then:
- It is recommended that you leave the default value of
machine-creation-timeoutto 0. - If there is a real need to set this value then
machine-creation-timeoutshould be > CA'smax-node-provision-time
- It is recommended that you leave the default value of