cloud-provider-vsphere icon indicating copy to clipboard operation
cloud-provider-vsphere copied to clipboard

CCM failed to clean up cache when deleting old nodes, causing new nodes to be unexpectedly deleted.

Open chenguo-github opened this issue 2 years ago • 15 comments

What happened?

  • CCM failed to clean up nodeNameMap when deleting old nodes, causing new nodes to be unexpectedly deleted.

  • The fix might be:

    Option 1: Clean up cache using node name directly. More specifically, change this part to

    // in case of a race condition that node with same name create happens before delete event, // delete the node based on uuid name := nm.getNodeNameByUUID(uuid) if name != "" { delete(nm.nodeNameMap, name) } else { klog.V(4).Info("node name: ", node.GetName(), " has a different uuid. Deleting this node from cache using node name directly.") delete(nm.nodeNameMap, node.GetName()) } delete(nm.nodeUUIDMap, uuid) nm.nodeInfoLock.Unlock()

    Option 2: Look up the node name using node.UID instead of node.Status.NodeInfo.SystemUUID.

What did you expect to happen?

nodeNameMap cleaned up correctly after node deletion.

How can we reproduce it (as minimally and precisely as possible)?

  • VM Recreate

    Old Node Obj not deleted at this time

  • Kubelet Register

    Update node.Status.NodeInfo.SystemUUID in old node obj with a new value

  • CCM Delete Old Node Obj

    Failed to clean up nodeNameMap using new SystemUUID Ref

  • Trigger VM Restart

  • Kubelet Register Again

    New node obj is created at this point, this new node obj is expected to be a valid one and should not be deleted.

  • CCM Unexpectedly Delete New Node Obj

    CCM got an old providerID from nodeNameMap because it was not cleaned up previously. Old providerID doesn't match any underlying VM, so CCM deleted the new node obj unexpectedly.

Anything else we need to know (please consider providing level 4 or above logs of CPI)?

2023-08-23T16:05:54.087418598Z I0823 16:05:54.087346 1 nodemanager.go:115] removeNode NodeName: 81ff17e25ec6-qual-335-1500f723, UID: 420aa1b2-3dc1-24d5-93cd-ee945c337130 2023-08-23T16:05:54.087425877Z I0823 16:05:54.087353 1 nodemanager.go:120] removeNode from UUID and Name cache. NodeName: 81ff17e25ec6-qual-335-1500f723, UID: 420aa1b2-3dc1-24d5-93cd-ee945c337130 2023-08-23T16:05:54.087430218Z I0823 16:05:54.087361 1 nodemanager.go:127] node name: 81ff17e25ec6-qual-335-1500f723 has a different uuid. Skip deleting this node from cache.

Kubernetes version

1.26

Cloud provider or hardware configuration

  • args:
  • --cloud-provider=vsphere
  • --v=4
  • --authentication-kubeconfig=/etc/kubernetes/admin.conf
  • --authorization-kubeconfig=/etc/kubernetes/admin.conf
  • --cloud-config=/etc/kubernetes/cloud-config/vsphere.conf
  • --kubeconfig=/etc/kubernetes/admin.conf

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Kernel (e.g. uname -a)

Install tools

Container runtime (CRI) and and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Others

chenguo-github avatar Sep 07 '23 01:09 chenguo-github

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 27 '24 20:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Feb 26 '24 21:02 k8s-triage-robot

Friendly ping

chenguo-github avatar Feb 29 '24 17:02 chenguo-github

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Mar 30 '24 18:03 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Mar 30 '24 18:03 k8s-ci-robot

/cc @xing-yang Do you know anyone could take a look at this issue? Thanks!

jingxu97 avatar Apr 04 '24 23:04 jingxu97

cc @lubronzhan

xing-yang avatar Apr 11 '24 15:04 xing-yang

cc @XudongLiuHarold could you help take a look?

lubronzhan avatar Apr 12 '24 00:04 lubronzhan

And @jingxu97 do you observe the issue with same CPI version?

lubronzhan avatar Apr 12 '24 00:04 lubronzhan

I think we're also running into this issue on RKE2 CPI version 1.28.0, but we've had it now for awhile. Wasn't able to find a cause before but this makes sense since we are deleting and then creating a new node with the same name (changing the underlying VM template)

If you kubectl get nodes -w you can actually see the new node appear for a brief moment before the CPI deletes it.

braunsonm avatar Apr 15 '24 18:04 braunsonm

Will spend some time to take a look at this issue this week.

/assign

XudongLiuHarold avatar Apr 22 '24 06:04 XudongLiuHarold

Glad to hear @XudongLiuHarold - Did you have a chance to take a look this week?

braunsonm avatar Apr 25 '24 21:04 braunsonm

Glad to hear @XudongLiuHarold - Did you have a chance to take a look this week?

Got occupied by company business for most of the time, took a look at this issue today.

Hi @lubronzhan , I saw your comment here:

	// in case of a race condition that node with same name create happens before delete event,
	// delete the node based on uuid
	name := nm.getNodeNameByUUID(uuid)
	if name != "" {
		delete(nm.nodeNameMap, name)
	} else {
		klog.V(4).Info("node name: ", node.GetName(), " has a different uuid. Delete this node from cache.")
		delete(nm.nodeNameMap, node.GetName())
	}
	delete(nm.nodeUUIDMap, uuid)
	nm.nodeInfoLock.Unlock()

Do you see any side effect if we still delete the node name from the cache map even the node name is not matching current deleting uuid anymore?

XudongLiuHarold avatar Apr 26 '24 04:04 XudongLiuHarold

The case in the comment shouldn't happen so I think it's ok to remove the second case.

lubronzhan avatar Apr 26 '24 22:04 lubronzhan

The case in the comment shouldn't happen so I think it's ok to remove the second case.

Gotcha, thanks!

I submit a fix here: https://github.com/kubernetes/cloud-provider-vsphere/pull/997

XudongLiuHarold avatar Apr 28 '24 04:04 XudongLiuHarold