cluster-api-provider-vsphere
cluster-api-provider-vsphere copied to clipboard
NodeLabeling feature uses Machine name instead of node name
/kind bug
What steps did you take and what happened: With the NodeLabeling feature turned on, capv controller is unable to label the managed cluster's nodes and emits errors similar to the following:
E1102 23:05:13.351244 1 node_controller.go:157] "capv-controller-manager/node-label-controller/cluster-name/cluster-name-6jtlt: unable to get node object" err="nodes \"cluster-name-6jtlt\" not found" cluster="cluster-name" machine="cluster-name-6jtlt" node="cluster-name-6jtlt"
E1102 23:05:13.351427 1 controller.go:326] "Reconciler error" err="nodes \"cluster-name-6jtlt\" not found" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" machine="cluster-name/cluster-name-6jtlt" namespace="cluster-name" name="cluster-name-6jtlt" reconcileID=a6d8a900-f72e-4a44-b388-216d8a191b6b
What did you expect to happen: Labeling of managed cluster's node should work correctly even if machine's name and node name differ
Anything else you would like to add: Looking at the machines, it seems that the issue is that we attach a suffix to the hostname when the nodes register with DNS, therefore node name and machine name are not equal and the controller is grabbing the machine name instead of the node name when trying to find and label nodes.
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
cluster-name-6jtlt cluster-name cluster-name-6jtlt.testnetwork vsphere://4216a7bd-43d1-926d-8565-cfef63a62a16 Running 89m v1.23.5
cluster-name-md-0-b994bb558-f6k9k cluster-name cluster-name-md-0-b994bb558-f6k9k.testnetwork vsphere://42164462-a671-ba3c-132c-84cd14a7acaf Running 95m v1.23.5
On the managed cluster:
NAME STATUS ROLES AGE VERSION
cluster-name-6jtlt.testnetwork Ready control-plane,master 90m v1.23.5
cluster-name-md-0-b994bb558-f6k9k.testnetwork Ready <none> 92m v1.23.5
Environment:
- Cluster-api-provider-vsphere version: 1.4.1
- Kubernetes version: (use
kubectl version
): 1.23.5 - OS (e.g. from
/etc/os-release
): ubuntu 20.04
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The way this has been implemented has an inherent assumption that the name of the CAPI machine, and the name of the node is the same. Instead we should query the status of the Machine
object and get the name of the node once it becomes available in the status.
Thanks for raising the issue, I will prioritize this one and work on the fix.
/remove-lifecycle stale /lifecycle active /help
@srm09: This request has been marked as needing help from a contributor.
Guidelines
Please ensure that the issue body includes answers to the following questions:
- Why are we solving this issue?
- To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
- Does this issue have zero to low barrier of entry?
- How can the assignee reach out to you for help?
For more details on the requirements of such an issue, please see here and ensure that they are met.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help
command.
In response to this:
The way this has been implemented has an inherent assumption that the name of the CAPI machine, and the name of the node is the same. Instead we should query the status of the
Machine
object and get the name of the node once it becomes available in the status. Thanks for raising the issue, I will prioritize this one and work on the fix./remove-lifecycle stale /lifecycle active /help
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Hey, do you need help with this? I solved this for our use case and I was wondering if I can do it here too? We had an issue where the domain was added to the name of the node, so our custom metadata propagation from the Machine toward the node did not work anymore.
Just a note. I'm not sure but this might be resolved now as CAPI v1.7 is just using the core CAPI node labeling feature which shouldn't depend on same Machine/Node names.
But this needs verification