CCM failed to clean up cache when deleting old nodes, causing new nodes to be unexpectedly deleted.
What happened?
-
CCM failed to clean up nodeNameMap when deleting old nodes, causing new nodes to be unexpectedly deleted.
-
The fix might be:
Option 1: Clean up cache using node name directly. More specifically, change this part to
// in case of a race condition that node with same name create happens before delete event, // delete the node based on uuid name := nm.getNodeNameByUUID(uuid) if name != "" { delete(nm.nodeNameMap, name) } else { klog.V(4).Info("node name: ", node.GetName(), " has a different uuid. Deleting this node from cache using node name directly.") delete(nm.nodeNameMap, node.GetName()) } delete(nm.nodeUUIDMap, uuid) nm.nodeInfoLock.Unlock()
Option 2: Look up the node name using node.UID instead of node.Status.NodeInfo.SystemUUID.
What did you expect to happen?
nodeNameMap cleaned up correctly after node deletion.
How can we reproduce it (as minimally and precisely as possible)?
-
VM Recreate
Old Node Obj not deleted at this time
-
Kubelet Register
Update node.Status.NodeInfo.SystemUUID in old node obj with a new value
-
CCM Delete Old Node Obj
Failed to clean up nodeNameMap using new SystemUUID Ref
-
Trigger VM Restart
-
Kubelet Register Again
New node obj is created at this point, this new node obj is expected to be a valid one and should not be deleted.
-
CCM Unexpectedly Delete New Node Obj
CCM got an old providerID from nodeNameMap because it was not cleaned up previously. Old providerID doesn't match any underlying VM, so CCM deleted the new node obj unexpectedly.
Anything else we need to know (please consider providing level 4 or above logs of CPI)?
2023-08-23T16:05:54.087418598Z I0823 16:05:54.087346 1 nodemanager.go:115] removeNode NodeName: 81ff17e25ec6-qual-335-1500f723, UID: 420aa1b2-3dc1-24d5-93cd-ee945c337130 2023-08-23T16:05:54.087425877Z I0823 16:05:54.087353 1 nodemanager.go:120] removeNode from UUID and Name cache. NodeName: 81ff17e25ec6-qual-335-1500f723, UID: 420aa1b2-3dc1-24d5-93cd-ee945c337130 2023-08-23T16:05:54.087430218Z I0823 16:05:54.087361 1 nodemanager.go:127] node name: 81ff17e25ec6-qual-335-1500f723 has a different uuid. Skip deleting this node from cache.
Kubernetes version
1.26
Cloud provider or hardware configuration
- args:
- --cloud-provider=vsphere
- --v=4
- --authentication-kubeconfig=/etc/kubernetes/admin.conf
- --authorization-kubeconfig=/etc/kubernetes/admin.conf
- --cloud-config=/etc/kubernetes/cloud-config/vsphere.conf
- --kubeconfig=/etc/kubernetes/admin.conf
OS version
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here
# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here
Kernel (e.g. uname -a)
Install tools
Container runtime (CRI) and and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
Others
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
Friendly ping
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/cc @xing-yang Do you know anyone could take a look at this issue? Thanks!
cc @lubronzhan
cc @XudongLiuHarold could you help take a look?
And @jingxu97 do you observe the issue with same CPI version?
I think we're also running into this issue on RKE2 CPI version 1.28.0, but we've had it now for awhile. Wasn't able to find a cause before but this makes sense since we are deleting and then creating a new node with the same name (changing the underlying VM template)
If you kubectl get nodes -w you can actually see the new node appear for a brief moment before the CPI deletes it.
Will spend some time to take a look at this issue this week.
/assign
Glad to hear @XudongLiuHarold - Did you have a chance to take a look this week?
Glad to hear @XudongLiuHarold - Did you have a chance to take a look this week?
Got occupied by company business for most of the time, took a look at this issue today.
Hi @lubronzhan , I saw your comment here:
// in case of a race condition that node with same name create happens before delete event,
// delete the node based on uuid
name := nm.getNodeNameByUUID(uuid)
if name != "" {
delete(nm.nodeNameMap, name)
} else {
klog.V(4).Info("node name: ", node.GetName(), " has a different uuid. Delete this node from cache.")
delete(nm.nodeNameMap, node.GetName())
}
delete(nm.nodeUUIDMap, uuid)
nm.nodeInfoLock.Unlock()
Do you see any side effect if we still delete the node name from the cache map even the node name is not matching current deleting uuid anymore?
The case in the comment shouldn't happen so I think it's ok to remove the second case.
The case in the comment shouldn't happen so I think it's ok to remove the second case.
Gotcha, thanks!
I submit a fix here: https://github.com/kubernetes/cloud-provider-vsphere/pull/997