AKS icon indicating copy to clipboard operation
AKS copied to clipboard

Failed to setup network for sandbox Failed to create endpoint: TransparentEndpointClient Error : operation not supported

Open andriktr opened this issue 2 years ago • 13 comments

After node restart pods not starting and hangs in Unknows status kubelet drops the following messages:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "0bc75e4c9f785c657c13bf1ad27a6e8f652ec01c8b48042acf2a27067edfa2e7": plugin type="azure-vnet" failed (add): Failed to create endpoint: TransparentEndpointClient Error : operation not supported

Here is how the pods looks like (all daemonsets):

image

Pods which I tried to recreate are stuck in ContainerCreating status.

Noticed that problematic node has newer kernel version: image

Restarting node, kubelet service or azure-cns pods does not helps.

Any thoughts?

andriktr avatar Oct 07 '23 20:10 andriktr

Seems issue is related to the following https://github.com/Azure/azure-container-networking/issues/2156

andriktr avatar Oct 08 '23 11:10 andriktr

Same here! Some deploys have worked, but some others don't

Updating: After restart cluster node seems to fix and now works fine

rmovieira avatar Oct 25 '23 15:10 rmovieira

Same issue also exists in the kernel version 6.2.0-1015-azure.

bartoszpyrek avatar Nov 10 '23 11:11 bartoszpyrek

For me the error was - [(combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox]

I was running AKS - 1.26.3 and Azure CNI.

After restarting the cluster from portal, this issue got resolved.

arpanD93 avatar Nov 10 '23 14:11 arpanD93

Same issue with AKS 1.26.10 and Azure CNI

For me node re-image solved the issue https://learn.microsoft.com/en-us/cli/azure/vmss?view=azure-cli-latest#az-vmss-reimage

andriktr avatar Dec 22 '23 12:12 andriktr

I have encountered the same issue. Pod scheduled on a particular node (which had a kernel version 6.x) kept crashing and the whole cluster was wonky.

The solution to the problem was to upgrade the VMSS to the latest scale set model (stackoverflow). After upgrading to latest, the kernel version changed to 5.15.0-1057-azure and all works OK now.

melkamar avatar Mar 21 '24 13:03 melkamar

CNI with fix has been released several months back. You may need to upgrade your VMSS to get new CNI for clusters using nodesubnet. Issue tracked in CNI side - https://github.com/Azure/azure-container-networking/issues/2156

tamilmani1989 avatar Apr 23 '24 20:04 tamilmani1989

Action required from @aritraghosh, @julia-yin, @AllenWen-at-Azure

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Closing the issue per comment from @tamilmani1989

AllenWen-at-Azure avatar Aug 29 '24 15:08 AllenWen-at-Azure