AKS Effect Node Drain when AKS worker node is unable to re-establish storage connectivity due to underlying platform issue.

Effect Node Drain when AKS worker node is unable to re-establish storage connectivity due to underlying platform issue.

Open sprab opened this issue 1 year ago • 5 comments

Is your feature request related to a problem? Please describe.

One of the AKS cluster nodes running Stateful sets and PVCs configured, was unable to re-establish connectivity to the storage when there was a platform issue, in this case an Event 17 between the physical host and storage. This caused the applications pods to be in a restart loop and could not get the PVCs mounted to the pods.

We chose to completely delete the deployment and re-deploy the pods to get this working.

Describe the solution you'd like

When there is a platform level failure impacting the Kubernetes layer, there should be a mechanism to detect and drain the node(s) where the CSI/Blobfuse driver pods are running, so as to run them on a healthy AKS node and re-establish the connectivity.

Describe alternatives you've considered

Deleted the PVCs and redeployed the Pods.

Additional context AKS Version: 1.29.4 CSI Driver Pod Image Version: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.29.6 mcr.microsoft.com/oss/kubernetes-csi/blob-csi:v1.23.4

Jul 23 '24 02:07 sprab

Action required from @aritraghosh, @julia-yin, @AllenWen-at-Azure