cluster-api-provider-vsphere NodeLabeling feature uses Machine name instead of node name

/kind bug

What steps did you take and what happened: With the NodeLabeling feature turned on, capv controller is unable to label the managed cluster's nodes and emits errors similar to the following:

E1102 23:05:13.351244       1 node_controller.go:157] "capv-controller-manager/node-label-controller/cluster-name/cluster-name-6jtlt: unable to get node object" err="nodes \"cluster-name-6jtlt\" not found" cluster="cluster-name" machine="cluster-name-6jtlt" node="cluster-name-6jtlt"
E1102 23:05:13.351427       1 controller.go:326] "Reconciler error" err="nodes \"cluster-name-6jtlt\" not found" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" machine="cluster-name/cluster-name-6jtlt" namespace="cluster-name" name="cluster-name-6jtlt" reconcileID=a6d8a900-f72e-4a44-b388-216d8a191b6b

What did you expect to happen: Labeling of managed cluster's node should work correctly even if machine's name and node name differ

Anything else you would like to add: Looking at the machines, it seems that the issue is that we attach a suffix to the hostname when the nodes register with DNS, therefore node name and machine name are not equal and the controller is grabbing the machine name instead of the node name when trying to find and label nodes.

NAME                                 CLUSTER         NODENAME                                         PROVIDERID                                       PHASE     AGE   VERSION
cluster-name-6jtlt                   cluster-name    cluster-name-6jtlt.testnetwork                   vsphere://4216a7bd-43d1-926d-8565-cfef63a62a16   Running   89m   v1.23.5
cluster-name-md-0-b994bb558-f6k9k    cluster-name    cluster-name-md-0-b994bb558-f6k9k.testnetwork    vsphere://42164462-a671-ba3c-132c-84cd14a7acaf   Running   95m   v1.23.5

On the managed cluster:

NAME                                             STATUS   ROLES                  AGE   VERSION
cluster-name-6jtlt.testnetwork                   Ready    control-plane,master   90m   v1.23.5
cluster-name-md-0-b994bb558-f6k9k.testnetwork    Ready    <none>                 92m   v1.23.5

Environment:

Cluster-api-provider-vsphere version: 1.4.1
Kubernetes version: (use kubectl version): 1.23.5
OS (e.g. from /etc/os-release): ubuntu 20.04

Nov 02 '22 23:11 tommasopozzetti

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Feb 01 '23 00:02 k8s-triage-robot

The way this has been implemented has an inherent assumption that the name of the CAPI machine, and the name of the node is the same. Instead we should query the status of the Machine object and get the name of the node once it becomes available in the status. Thanks for raising the issue, I will prioritize this one and work on the fix.

/remove-lifecycle stale /lifecycle active /help

Feb 16 '23 18:02 srm09

@srm09: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

The way this has been implemented has an inherent assumption that the name of the CAPI machine, and the name of the node is the same. Instead we should query the status of the Machine object and get the name of the node once it becomes available in the status. Thanks for raising the issue, I will prioritize this one and work on the fix.

/remove-lifecycle stale /lifecycle active /help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Feb 16 '23 18:02 k8s-ci-robot

Hey, do you need help with this? I solved this for our use case and I was wondering if I can do it here too? We had an issue where the domain was added to the name of the node, so our custom metadata propagation from the Machine toward the node did not work anymore.

May 16 '23 07:05 RnkeZ

Just a note. I'm not sure but this might be resolved now as CAPI v1.7 is just using the core CAPI node labeling feature which shouldn't depend on same Machine/Node names.

But this needs verification

Jul 25 '23 11:07 sbueringer

cluster-api-provider-vsphere cluster-api-provider-vsphere copied to clipboard

NodeLabeling feature uses Machine name instead of node name

Guidelines

cluster-api-provider-vsphere
cluster-api-provider-vsphere copied to clipboard