azure-container-networking icon indicating copy to clipboard operation
azure-container-networking copied to clipboard

Broken CNI state leaves node marked ready but unusable

Open davidmatson opened this issue 5 years ago • 17 comments

Is this an ISSUE or FEATURE REQUEST? (choose one):

FEATURE REQUEST

Which release version?:

1.0.17

Which component (CNI/IPAM/CNM/CNS):

CNI?

Which Operating System (Linux/Windows):

Windows

For windows: provide output of "$(Get-ItemProperty -Path "C:\windows\system32\hal.dll").VersionInfo.FileVersion"

10.0.17763.194 (WinBuild.160101.0800)

Which Orchestrator and version (e.g. Kubernetes, Docker)

Kubernetes 1.13.3

What happened:

The CNI plugin can get into a bad state where any new pod attempting to run on the node fails. See, for example, #321, though we've had multiple other bugs in the past (hopefully all now since fixed). When the node gets into a bad state, it still looks healthy to K8s, so it doesn't know not to try to schedule pods on this node.

What you expected to happen:

When the CNI plugin gets into a broken state, it marks the node as NotReady so that other techniques such as cluster autoscaler can automatically remedy the situation.

How to reproduce it (as minimally and precisely as possible):

Let the CNI get into a broken state due to some bug, and then try to schedule a new pod on the node (including via auto-scaling).

Anything else we need to know:

This issue is probably the biggest one we're currently having running a Windows AKS-Engine workload in production - nodes get stuck, and don't auto-recover. If errors from the network stack propagated to node health, I think we'd have significantly better cluster availability.

davidmatson avatar Mar 27 '19 23:03 davidmatson

This may happen if the lock file is leaked from cni. There were fixes in aks-engine v0.31.2 and azure cni v1.0.18 to address that. Can you please check if the issue is reproduced with those?

ashvindeodhar avatar Mar 28 '19 00:03 ashvindeodhar

This issue/feature request is specifically about the idea that if for any reason CNI stops working on a node, the node isn't marked Not Ready, so new pods still try to run on it. We've seen this behavior with multiple bugs (the lock file is just one specific example).

davidmatson avatar Mar 28 '19 00:03 davidmatson

@davidmatson are you seeing this issue with v1.0.18?

ashvindeodhar avatar Apr 10 '19 14:04 ashvindeodhar

This issue isn't specific to v1.0.18, but yes I am still seeing problems there. This issue is about the fact that multiple different CNI bugs can cause the node to stop working, but when that happens, the node isn't reported as Not Ready so that cluster autoscaler can remove it.

davidmatson avatar Apr 12 '19 01:04 davidmatson

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

github-actions[bot] avatar Feb 09 '22 18:02 github-actions[bot]

/remove-stale

davidmatson avatar Feb 09 '22 18:02 davidmatson

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

github-actions[bot] avatar Apr 11 '22 00:04 github-actions[bot]

/remove-stale

davidmatson avatar Apr 11 '22 17:04 davidmatson

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

github-actions[bot] avatar Jun 11 '22 00:06 github-actions[bot]

/remove-stale

davidmatson avatar Jun 11 '22 00:06 davidmatson

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

github-actions[bot] avatar Aug 12 '22 00:08 github-actions[bot]

/remove-stale

davidmatson avatar Aug 12 '22 00:08 davidmatson

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

github-actions[bot] avatar Oct 12 '22 00:10 github-actions[bot]

/remove-stale

davidmatson avatar Oct 12 '22 00:10 davidmatson

@davidmatson I'm not sure that this is something that we can reasonable address from the CNI. We already return errors when we encounter them to the CRI (which should return them to Kubelet, etc). The CNI doesn't/shouldn't know how to talk to Kubernetes directly, and so can't update Node statuses. As far as I know, this isn't something that any other CNIs do either - do you have an example of the desired behavior you have described?

rbtr avatar Oct 12 '22 16:10 rbtr

@rbtr - I'm not familiar with the details of the implementation that would be needed here. This bug is just "when it knows it's broken, that knowledge doesn't propagate far enough to be useful." Perhaps the problem is in how the errors are returned to the CRI, or how CRI returns them to Kubelet, etc - someone with more knowledge of how the various pieces relate may be able to answer the right way to ensure such errors get propagated (i.e., whether there's a problem at this layer or whether there's a gap in an upstream component).

davidmatson avatar Oct 12 '22 17:10 davidmatson

@paulgmiller wdyt about this ask?

When the CNI plugin gets into a broken state, it marks the node as NotReady so that other techniques such as cluster autoscaler can automatically remedy the situation.

AFAIK this is not something that the CNI spec accommodates and I don't think any other CNIs handle this internally. I think we are doing what we should be in returning errors to the CRI; this seems like something that some other monitoring component would have to catch. I'm opposed to letting the CNI talk to k8s itself.

rbtr avatar Oct 13 '22 02:10 rbtr

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

github-actions[bot] avatar Dec 13 '22 00:12 github-actions[bot]

/remove-stale

davidmatson avatar Dec 13 '22 17:12 davidmatson

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

github-actions[bot] avatar Feb 13 '23 00:02 github-actions[bot]

/remove-stale

davidmatson avatar Feb 13 '23 00:02 davidmatson

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

github-actions[bot] avatar Apr 16 '23 00:04 github-actions[bot]

/remove-stale

davidmatson avatar Apr 17 '23 18:04 davidmatson

Azure CNI conforms to CNI spec and returns error to CRI and CRI propagates to kubelet. If you want this to get reflected this in k8s node object, then probably you have to open ticket with kubernetes upstream https://github.com/kubernetes/kubernetes and follow-up there. Let me know if you have any other asks, otherwise we can close this ticket.

tamilmani1989 avatar May 18 '23 20:05 tamilmani1989

kubernetes/kubernetes#118118

davidmatson avatar May 18 '23 20:05 davidmatson

Per kubernetes/kubernetes#118118, this should be handled by the plugin, if I've understood correctly.

davidmatson avatar May 30 '23 17:05 davidmatson

@davidmatson this is not functionality that we will add to AzCNI, for the previously mentioned reasons.

In AKS, it may be handled by CNS in the future and is partially handled already for failure to initialize the node.

rbtr avatar Jun 07 '23 17:06 rbtr

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

github-actions[bot] avatar Aug 07 '23 00:08 github-actions[bot]

Issue closed due to inactivity.

github-actions[bot] avatar Aug 22 '23 00:08 github-actions[bot]