azure-container-networking
azure-container-networking copied to clipboard
Broken CNI state leaves node marked ready but unusable
Is this an ISSUE or FEATURE REQUEST? (choose one):
FEATURE REQUEST
Which release version?:
1.0.17
Which component (CNI/IPAM/CNM/CNS):
CNI?
Which Operating System (Linux/Windows):
Windows
For windows: provide output of "$(Get-ItemProperty -Path "C:\windows\system32\hal.dll").VersionInfo.FileVersion"
10.0.17763.194 (WinBuild.160101.0800)
Which Orchestrator and version (e.g. Kubernetes, Docker)
Kubernetes 1.13.3
What happened:
The CNI plugin can get into a bad state where any new pod attempting to run on the node fails. See, for example, #321, though we've had multiple other bugs in the past (hopefully all now since fixed). When the node gets into a bad state, it still looks healthy to K8s, so it doesn't know not to try to schedule pods on this node.
What you expected to happen:
When the CNI plugin gets into a broken state, it marks the node as NotReady so that other techniques such as cluster autoscaler can automatically remedy the situation.
How to reproduce it (as minimally and precisely as possible):
Let the CNI get into a broken state due to some bug, and then try to schedule a new pod on the node (including via auto-scaling).
Anything else we need to know:
This issue is probably the biggest one we're currently having running a Windows AKS-Engine workload in production - nodes get stuck, and don't auto-recover. If errors from the network stack propagated to node health, I think we'd have significantly better cluster availability.
This may happen if the lock file is leaked from cni. There were fixes in aks-engine v0.31.2 and azure cni v1.0.18 to address that. Can you please check if the issue is reproduced with those?
This issue/feature request is specifically about the idea that if for any reason CNI stops working on a node, the node isn't marked Not Ready, so new pods still try to run on it. We've seen this behavior with multiple bugs (the lock file is just one specific example).
@davidmatson are you seeing this issue with v1.0.18?
This issue isn't specific to v1.0.18, but yes I am still seeing problems there. This issue is about the fact that multiple different CNI bugs can cause the node to stop working, but when that happens, the node isn't reported as Not Ready so that cluster autoscaler can remove it.
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days
/remove-stale
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days
/remove-stale
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days
/remove-stale
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days
/remove-stale
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days
/remove-stale
@davidmatson I'm not sure that this is something that we can reasonable address from the CNI. We already return errors when we encounter them to the CRI (which should return them to Kubelet, etc). The CNI doesn't/shouldn't know how to talk to Kubernetes directly, and so can't update Node statuses. As far as I know, this isn't something that any other CNIs do either - do you have an example of the desired behavior you have described?
@rbtr - I'm not familiar with the details of the implementation that would be needed here. This bug is just "when it knows it's broken, that knowledge doesn't propagate far enough to be useful." Perhaps the problem is in how the errors are returned to the CRI, or how CRI returns them to Kubelet, etc - someone with more knowledge of how the various pieces relate may be able to answer the right way to ensure such errors get propagated (i.e., whether there's a problem at this layer or whether there's a gap in an upstream component).
@paulgmiller wdyt about this ask?
When the CNI plugin gets into a broken state, it marks the node as NotReady so that other techniques such as cluster autoscaler can automatically remedy the situation.
AFAIK this is not something that the CNI spec accommodates and I don't think any other CNIs handle this internally. I think we are doing what we should be in returning errors to the CRI; this seems like something that some other monitoring component would have to catch. I'm opposed to letting the CNI talk to k8s itself.
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days
/remove-stale
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days
/remove-stale
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days
/remove-stale
Azure CNI conforms to CNI spec and returns error to CRI and CRI propagates to kubelet. If you want this to get reflected this in k8s node object, then probably you have to open ticket with kubernetes upstream https://github.com/kubernetes/kubernetes and follow-up there. Let me know if you have any other asks, otherwise we can close this ticket.
kubernetes/kubernetes#118118
Per kubernetes/kubernetes#118118, this should be handled by the plugin, if I've understood correctly.
@davidmatson this is not functionality that we will add to AzCNI, for the previously mentioned reasons.
In AKS, it may be handled by CNS in the future and is partially handled already for failure to initialize the node.
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days
Issue closed due to inactivity.