machine-controller-manager Erratic serialized drain if there are large number of volumes attached per node

What happened: In a provider like GCP or Azure (where a relatively large number of volumes are allowed to be attached per node), serialized eviction of pods with volumes while draining a node shows some erratic behaviour. Most pod evictions (and the corresponding volume detachment) takes between 4s to 15s. But if there are a large number of volumes attached to a node (>= 40), sometimes (unpredictably), a bunch of pods are deleted (and their corresponding volumes detached) within a matter of 5ms-10ms.

Though the drain logic thinks that the pods' volumes are detached in a matter of milliseconds, in reality these volumes are not fully detached and this causes disproportionate delays in attachment of the volumes and and startup of the replacement pods.

What you expected to happen: The serialized eviction of pods should proceed normally irrespective of the number of pods with volume per node.

How to reproduce it (as minimally and precisely as possible): Steps:

Choose a Kubernetes cluster with nodes hosted in GCP
Deploy a large number of pods with volumes (>=40) into a single node (e.g. using a combination of nodeAffinity, taints and tolerations).
Delete the MCM Machine object backing the node on which the pods are hosted.
Monitor the pod status, node status (especially, node.Status.VolumesAttached) and MCM logs.
For the most part the serialized eviction goes on as designed with an interval of anywhere between 4 to 15s per pod with volume. But sometimes a bunch of pods are evicted and volumes are detached in a matter of milliseconds. This happens rarely and unpredictably. This erratic behaviour can be reproduced more reliably with even larger number of pods with volume (50 or more) per node. I have never seen this happen with <=20 volumes per node.

Anything else we need to know: MCM watched node.Status.VolumesAttached to check if a volume has been detached after the corresponding pod has been evicted. But I have noticed inconsistency in updating of the node.Status.VolumesAttached if there are a large number of volumes attached per node. Sometimes, after eviction of the pod, the corresponding volume gets removed too quickly from node.Status.VolumesAttached but then it reappears in the array, only to disappear again. Sometimes, it even make a few such disappearances and reappearances before going away for good. In this case, MCM would consider the volume to be detached at the first disappearance and would move on to the next pod eviction.

Environment: provider: GCP or Azure

Approaches for resolution:

Identify the race condition in upstream kubernetes or cloud provider controllers and contribute a fix there.
Add an additional timeout in drain logic in MCM to see if a detached volume stays showing as detached in the node status.

Jun 08 '20 12:06 amshuman-kr

/priority critical

Sep 07 '20 07:09 hardikdr

@ggaurav10 do you see any challenge in adding a minor delay after evicting the volume-based pods - to confirm detached volume is not flapping but gone for good, also how do you generally see the approach?

Oct 09 '20 18:10 hardikdr

TL;DR: Generally, the approach looks good.

Just thinking out loud: in absence of the upstream fix, i think introducing a configurable delay should be helpful in controlling the eviction. It can even be enabled only when more than a certain number of volumes are attached so that eviction of nodes with lesser volumes is not slowed down. This will also help in testing when k8s finally fixes the apparent race issue.

Just wondering if MCM should wait for that delay only when it sees that the volume got detached "too quickly" (say within 1 second).

Oct 13 '20 12:10 ggaurav10

We discussed today to pick this up later after the OOT for Azure is out. cc @AxiomSamarth .

Oct 14 '20 13:10 hardikdr

Right, @hardikdr. Now with kupid we can steer where we want to have our ETCDs and how many of them. /priority normal

Oct 14 '20 17:10 vlerenc

To be fixed with https://github.com/gardener/machine-controller-manager/issues/621

Jul 21 '21 05:07 prashanth26

This problem is solved now with since in the current drain code, we don't just wait for volume detach, but we also wait for volume attachment to another node. So, even if volume transiently disappears from oldNode.Status.VolumesAttached it doesn't matter much since we also wait till it arrives in newNode.Status.VolumesAttached

Feb 14 '23 08:02 elankath

/close as per explanation given by Tarun above

Feb 14 '23 08:02 himanshu-kun