cloud-provider-vsphere CCM failed to clean up cache when deleting old nodes, causing new nodes to be unexpectedly deleted.

What happened?

CCM failed to clean up nodeNameMap when deleting old nodes, causing new nodes to be unexpectedly deleted.
The fix might be:

Option 1: Clean up cache using node name directly. More specifically, change this part to

// in case of a race condition that node with same name create happens before delete event, // delete the node based on uuid name := nm.getNodeNameByUUID(uuid) if name != "" { delete(nm.nodeNameMap, name) } else { klog.V(4).Info("node name: ", node.GetName(), " has a different uuid. Deleting this node from cache using node name directly.") delete(nm.nodeNameMap, node.GetName()) } delete(nm.nodeUUIDMap, uuid) nm.nodeInfoLock.Unlock()

Option 2: Look up the node name using node.UID instead of node.Status.NodeInfo.SystemUUID.

What did you expect to happen?

nodeNameMap cleaned up correctly after node deletion.

How can we reproduce it (as minimally and precisely as possible)?

VM Recreate

Old Node Obj not deleted at this time
Kubelet Register

Update node.Status.NodeInfo.SystemUUID in old node obj with a new value
CCM Delete Old Node Obj

Failed to clean up nodeNameMap using new SystemUUID Ref
Trigger VM Restart
Kubelet Register Again

New node obj is created at this point, this new node obj is expected to be a valid one and should not be deleted.
CCM Unexpectedly Delete New Node Obj

CCM got an old providerID from nodeNameMap because it was not cleaned up previously. Old providerID doesn't match any underlying VM, so CCM deleted the new node obj unexpectedly.

Anything else we need to know (please consider providing level 4 or above logs of CPI)?

2023-08-23T16:05:54.087418598Z I0823 16:05:54.087346 1 nodemanager.go:115] removeNode NodeName: 81ff17e25ec6-qual-335-1500f723, UID: 420aa1b2-3dc1-24d5-93cd-ee945c337130 2023-08-23T16:05:54.087425877Z I0823 16:05:54.087353 1 nodemanager.go:120] removeNode from UUID and Name cache. NodeName: 81ff17e25ec6-qual-335-1500f723, UID: 420aa1b2-3dc1-24d5-93cd-ee945c337130 2023-08-23T16:05:54.087430218Z I0823 16:05:54.087361 1 nodemanager.go:127] node name: 81ff17e25ec6-qual-335-1500f723 has a different uuid. Skip deleting this node from cache.

Kubernetes version

1.26

Cloud provider or hardware configuration

args:
--cloud-provider=vsphere
--v=4
--authentication-kubeconfig=/etc/kubernetes/admin.conf
--authorization-kubeconfig=/etc/kubernetes/admin.conf
--cloud-config=/etc/kubernetes/cloud-config/vsphere.conf
--kubeconfig=/etc/kubernetes/admin.conf

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Kernel (e.g. `uname -a`)

Install tools

Container runtime (CRI) and and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Others

Sep 07 '23 01:09 chenguo-github

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 27 '24 20:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Feb 26 '24 21:02 k8s-triage-robot

Friendly ping

Feb 29 '24 17:02 chenguo-github

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Mar 30 '24 18:03 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Mar 30 '24 18:03 k8s-ci-robot

/cc @xing-yang Do you know anyone could take a look at this issue? Thanks!

Apr 04 '24 23:04 jingxu97

cc @lubronzhan

Apr 11 '24 15:04 xing-yang

cc @XudongLiuHarold could you help take a look?

Apr 12 '24 00:04 lubronzhan

And @jingxu97 do you observe the issue with same CPI version?

Apr 12 '24 00:04 lubronzhan

I think we're also running into this issue on RKE2 CPI version 1.28.0, but we've had it now for awhile. Wasn't able to find a cause before but this makes sense since we are deleting and then creating a new node with the same name (changing the underlying VM template)

If you kubectl get nodes -w you can actually see the new node appear for a brief moment before the CPI deletes it.

Apr 15 '24 18:04 braunsonm

Will spend some time to take a look at this issue this week.

/assign

Apr 22 '24 06:04 XudongLiuHarold

Glad to hear @XudongLiuHarold - Did you have a chance to take a look this week?

Apr 25 '24 21:04 braunsonm

Glad to hear @XudongLiuHarold - Did you have a chance to take a look this week?

Got occupied by company business for most of the time, took a look at this issue today.

Hi @lubronzhan , I saw your comment here:

	// in case of a race condition that node with same name create happens before delete event,
	// delete the node based on uuid
	name := nm.getNodeNameByUUID(uuid)
	if name != "" {
		delete(nm.nodeNameMap, name)
	} else {
		klog.V(4).Info("node name: ", node.GetName(), " has a different uuid. Delete this node from cache.")
		delete(nm.nodeNameMap, node.GetName())
	}
	delete(nm.nodeUUIDMap, uuid)
	nm.nodeInfoLock.Unlock()

Do you see any side effect if we still delete the node name from the cache map even the node name is not matching current deleting uuid anymore?

Apr 26 '24 04:04 XudongLiuHarold

The case in the comment shouldn't happen so I think it's ok to remove the second case.

Apr 26 '24 22:04 lubronzhan

The case in the comment shouldn't happen so I think it's ok to remove the second case.

Gotcha, thanks!

I submit a fix here: https://github.com/kubernetes/cloud-provider-vsphere/pull/997

Apr 28 '24 04:04 XudongLiuHarold

CCM failed to clean up cache when deleting old nodes, causing new nodes to be unexpectedly deleted.

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know (please consider providing level 4 or above logs of CPI)?

Kubernetes version

Cloud provider or hardware configuration

OS version

Kernel (e.g. uname -a)

Install tools

Container runtime (CRI) and and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Others

Kernel (e.g. `uname -a`)