machine-controller-manager Move drain logic into a separate controller

How to categorize this issue?

/area dev-productivity /kind enhancement /priority 3

What would you like to be added: We would like to move out the drain logic of the provider machine controller into a separate controller. With this draining logic would no longer be a part of the core machine reconcile logic It would rather be a different controller taking care of this logic. The machine controller and drain controller can use the machine CRD status/labels to coordinate the handover.

[ ] Separate out the drain logic into a new controller.
[ ] Switch to use of volume attachments resource while detaching volumes https://github.com/gardener/machine-controller-manager/pull/608/files#r644498475
[ ] Eliminate the use of goroutines to coordinate event handlings. Preferably, convert long reconciliations with goroutines and waiting, into short reconciliations spread over time co-ordinated using some additional fields in the status of the Machine resource.
[x] Handle scenarios for erratic serialized drain - https://github.com/gardener/machine-controller-manager/issues/468
[ ] Enable force-deletion label to trigger force delete even if drain is ongoing. Preferably delete the VM itself

Why is this needed: To make the machine reconcile loop more efficient.

Jun 14 '21 06:06 prashanth26

/remove internal @prashanth26 I hope it's OK if I remove the roadmap label. With internal we rather mean of central importance to Gardener as a whole. All teams have (team-) internal tasks, but those are not reported in the global roadmap or else it becomes too large. Things like improved-drain-of-pods-with-volumes-v2 is internal and rightly so, because it has such a huge impact to everybody. That's kind of the differentiation. I hope, that's fine.

Jun 15 '21 13:06 vlerenc

/remove internal @prashanth26 I hope it's OK if I remove the roadmap label. With internal we rather mean of central importance to Gardener as a whole. All teams have (team-) internal tasks, but those are not reported in the global roadmap or else it becomes too large. Things like improved-drain-of-pods-with-volumes-v2 is internal and rightly so, because it has such a huge impact to everybody. That's kind of the differentiation. I hope, that's fine.

Sure Vedran. No problem, sounds good to me.

Jun 15 '21 13:06 prashanth26

This is also important if the machineDrainTimeout is changed during an ongoing drain e.g in Gardener for a worker pool. The only thing that helps is currently is to edit the machine and restart the MCM.

Nov 04 '21 15:11 danielfoehrKn

Another case to handle - If a machine drain is ongoing and a force-deletion:true label is added on the machine obj, then force deletion of machine should happen. Currently on shoot deletion force deletion label is added , but it doesn't force delete the machine, if machine is already draining or stuck in draining , and thus shoot deletion is also stuck (until drain timeout occurs which could be of days)

Dec 30 '21 05:12 himanshu-kun

Grooming decision: Too much refactoring needed in current code. Agree on drain being a separate controller. We should target this for controller-runtime port.

Feb 14 '23 08:02 elankath

machine-controller-manager machine-controller-manager copied to clipboard

Move drain logic into a separate controller

machine-controller-manager
machine-controller-manager copied to clipboard