AKS icon indicating copy to clipboard operation
AKS copied to clipboard

Effect Node Drain when AKS worker node is unable to re-establish storage connectivity due to underlying platform issue.

Open sprab opened this issue 1 year ago • 5 comments

Is your feature request related to a problem? Please describe.

One of the AKS cluster nodes running Stateful sets and PVCs configured, was unable to re-establish connectivity to the storage when there was a platform issue, in this case an Event 17 between the physical host and storage. This caused the applications pods to be in a restart loop and could not get the PVCs mounted to the pods.

We chose to completely delete the deployment and re-deploy the pods to get this working.

Describe the solution you'd like

When there is a platform level failure impacting the Kubernetes layer, there should be a mechanism to detect and drain the node(s) where the CSI/Blobfuse driver pods are running, so as to run them on a healthy AKS node and re-establish the connectivity.

Describe alternatives you've considered

Deleted the PVCs and redeployed the Pods.

Additional context AKS Version: 1.29.4 CSI Driver Pod Image Version: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.29.6 mcr.microsoft.com/oss/kubernetes-csi/blob-csi:v1.23.4

sprab avatar Jul 23 '24 02:07 sprab

Action required from @aritraghosh, @julia-yin, @AllenWen-at-Azure

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

This issue has been automatically marked as stale because it has not had any activity for 180 days. It will be closed if no further activity occurs within 7 days of this comment. @yuemlu

This issue will now be closed because it hasn't had any activity for 7 days after stale. @sprab feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.