azure-container-networking Broken CNI state leaves node marked ready but unusable

Is this an ISSUE or FEATURE REQUEST? (choose one):

FEATURE REQUEST

Which release version?:

1.0.17

Which component (CNI/IPAM/CNM/CNS):

CNI?

Which Operating System (Linux/Windows):

Windows

For windows: provide output of "$(Get-ItemProperty -Path "C:\windows\system32\hal.dll").VersionInfo.FileVersion"

10.0.17763.194 (WinBuild.160101.0800)

Which Orchestrator and version (e.g. Kubernetes, Docker)

Kubernetes 1.13.3

What happened:

The CNI plugin can get into a bad state where any new pod attempting to run on the node fails. See, for example, #321, though we've had multiple other bugs in the past (hopefully all now since fixed). When the node gets into a bad state, it still looks healthy to K8s, so it doesn't know not to try to schedule pods on this node.

What you expected to happen:

When the CNI plugin gets into a broken state, it marks the node as NotReady so that other techniques such as cluster autoscaler can automatically remedy the situation.

How to reproduce it (as minimally and precisely as possible):

Let the CNI get into a broken state due to some bug, and then try to schedule a new pod on the node (including via auto-scaling).

Anything else we need to know:

This issue is probably the biggest one we're currently having running a Windows AKS-Engine workload in production - nodes get stuck, and don't auto-recover. If errors from the network stack propagated to node health, I think we'd have significantly better cluster availability.

Mar 27 '19 23:03 davidmatson

This may happen if the lock file is leaked from cni. There were fixes in aks-engine v0.31.2 and azure cni v1.0.18 to address that. Can you please check if the issue is reproduced with those?

Mar 28 '19 00:03 ashvindeodhar

This issue/feature request is specifically about the idea that if for any reason CNI stops working on a node, the node isn't marked Not Ready, so new pods still try to run on it. We've seen this behavior with multiple bugs (the lock file is just one specific example).

Mar 28 '19 00:03 davidmatson

@davidmatson are you seeing this issue with v1.0.18?

Apr 10 '19 14:04 ashvindeodhar

This issue isn't specific to v1.0.18, but yes I am still seeing problems there. This issue is about the fact that multiple different CNI bugs can cause the node to stop working, but when that happens, the node isn't reported as Not Ready so that cluster autoscaler can remove it.

Apr 12 '19 01:04 davidmatson

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

Feb 09 '22 18:02 github-actions[bot]

/remove-stale

Feb 09 '22 18:02 davidmatson

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

Apr 11 '22 00:04 github-actions[bot]

/remove-stale

Apr 11 '22 17:04 davidmatson

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

Jun 11 '22 00:06 github-actions[bot]

/remove-stale

Jun 11 '22 00:06 davidmatson

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

Aug 12 '22 00:08 github-actions[bot]

/remove-stale

Aug 12 '22 00:08 davidmatson

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

Oct 12 '22 00:10 github-actions[bot]

/remove-stale

Oct 12 '22 00:10 davidmatson

@davidmatson I'm not sure that this is something that we can reasonable address from the CNI. We already return errors when we encounter them to the CRI (which should return them to Kubelet, etc). The CNI doesn't/shouldn't know how to talk to Kubernetes directly, and so can't update Node statuses. As far as I know, this isn't something that any other CNIs do either - do you have an example of the desired behavior you have described?

Oct 12 '22 16:10 rbtr

@rbtr - I'm not familiar with the details of the implementation that would be needed here. This bug is just "when it knows it's broken, that knowledge doesn't propagate far enough to be useful." Perhaps the problem is in how the errors are returned to the CRI, or how CRI returns them to Kubelet, etc - someone with more knowledge of how the various pieces relate may be able to answer the right way to ensure such errors get propagated (i.e., whether there's a problem at this layer or whether there's a gap in an upstream component).

Oct 12 '22 17:10 davidmatson

@paulgmiller wdyt about this ask?

When the CNI plugin gets into a broken state, it marks the node as NotReady so that other techniques such as cluster autoscaler can automatically remedy the situation.

AFAIK this is not something that the CNI spec accommodates and I don't think any other CNIs handle this internally. I think we are doing what we should be in returning errors to the CRI; this seems like something that some other monitoring component would have to catch. I'm opposed to letting the CNI talk to k8s itself.

Oct 13 '22 02:10 rbtr

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

Dec 13 '22 00:12 github-actions[bot]

/remove-stale

Dec 13 '22 17:12 davidmatson

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

Feb 13 '23 00:02 github-actions[bot]

/remove-stale

Feb 13 '23 00:02 davidmatson

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

Apr 16 '23 00:04 github-actions[bot]

/remove-stale

Apr 17 '23 18:04 davidmatson

Azure CNI conforms to CNI spec and returns error to CRI and CRI propagates to kubelet. If you want this to get reflected this in k8s node object, then probably you have to open ticket with kubernetes upstream https://github.com/kubernetes/kubernetes and follow-up there. Let me know if you have any other asks, otherwise we can close this ticket.

May 18 '23 20:05 tamilmani1989

kubernetes/kubernetes#118118

May 18 '23 20:05 davidmatson

Per kubernetes/kubernetes#118118, this should be handled by the plugin, if I've understood correctly.

May 30 '23 17:05 davidmatson

@davidmatson this is not functionality that we will add to AzCNI, for the previously mentioned reasons.

In AKS, it may be handled by CNS in the future and is partially handled already for failure to initialize the node.

Jun 07 '23 17:06 rbtr

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

Aug 07 '23 00:08 github-actions[bot]

Issue closed due to inactivity.

Aug 22 '23 00:08 github-actions[bot]

azure-container-networking azure-container-networking copied to clipboard

Broken CNI state leaves node marked ready but unusable

Is this an ISSUE or FEATURE REQUEST? (choose one):

Which release version?:

Which component (CNI/IPAM/CNM/CNS):

Which Operating System (Linux/Windows):

For windows: provide output of "$(Get-ItemProperty -Path "C:\windows\system32\hal.dll").VersionInfo.FileVersion"

Which Orchestrator and version (e.g. Kubernetes, Docker)

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know:

azure-container-networking
azure-container-networking copied to clipboard