ovn-kubernetes icon indicating copy to clipboard operation
ovn-kubernetes copied to clipboard

node deletion results stale lsps and IP leaking on layer2/localnet networks

Open cathy-zhou opened this issue 9 months ago • 1 comments

What happened?

when a node is deleted, the pods scheduled on it are also deleted. But the pods' associated logical switch ports stay in the ovn-db and Pod IPs are not released either.

What did you expect to happen?

we expected that when any pod scheduled on the deleted node is deleted, its assoicated logical switch port would be deleted and pod's IP would be released.

How can we reproduce it (as minimally and precisely as possible)?

in non-IC case, create a pod with localnet network on a node, then directly delete that node.

Anything else we need to know?

the problem is that when a pod is deleted, ovnkube-controller checks if the pod is scheduled on the local node, but at that time, the node deletion handler has already deleted the node from the localNode cache. The check fails and nothing is done for that pod, and leave the lsps behind.

OVN-Kubernetes version

downstream ovn-kubernetes based on upstream commit up to a5ef4eeede2. but I believe the issue still exists in the current upstream code.

Kubernetes version

N/A

OVN version

$ oc rsh -n ovn-kubernetes ovnkube-node-xxxxx (pick any ovnkube-node pod on your cluster)
$ rpm -q ovn
# paste output here

OVS version

$ oc rsh -n ovn-kubernetes ovs-node-xxxxx (pick any ovs pod on your cluster)
$ rpm -q openvswitch
# paste output here

Platform

Is it baremetal? GCP? AWS? Azure?

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

cathy-zhou avatar May 23 '24 00:05 cathy-zhou