cluster-api-provider-vsphere
cluster-api-provider-vsphere copied to clipboard
Bootstrap failure detection
/kind feature
Describe the solution you'd like If bootstrapping doesn't succeed, set VSphereMachine.Status.FailureReason and FailureMessage to indicate there was an error.
Anything else you would like to add: More information in https://github.com/kubernetes-sigs/cluster-api/issues/2554.
We may need to amend the bootstrap provider contract to require them to write a sentinel file indicating success to a specific location because not all bootstrap providers will necessarily use cloud-init and have that as a consistent means for checking success/failure. I will probably write up a separate proposal for that.
We will need a way for the CAPV controller to be able to determine the status for a given VSphereMachine. Does vCenter have any services where the bootstrap logic in the VM could send an update to indicate bootstrap success? Or maybe some way to label or tag the actual VM? This would need to be properly secured, presumably.
/assign
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
This may be doable using VM Guest Tools and govmomi, and could use the bootstrap sentinel feature that CAPZ implements, or maybe write something to the node status.
/help
@sbueringer: This request has been marked as needing help from a contributor.
Guidelines
Please ensure that the issue body includes answers to the following questions:
- Why are we solving this issue?
- To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
- Does this issue have zero to low barrier of entry?
- How can the assignee reach out to you for help?
For more details on the requirements of such an issue, please see here and ensure that they are met.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.
In response to this:
/help
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Slightly more details.
One way to do this is roughly:
- If bootstrap failed write some information to guestinfo, example:
vmtoolsd --cmd "info-set guestinfo.capv.bootstrap "failed" - If the CAPV controller finds this information in guestinfo it can update the VSphereVM accordingly (details tbd but probably set failureReason/failureMessage)
The tricky part is probably, if this triggers a VM re-creation via MachineHealthCheck, how do we ensure we don't re-create VMs very quickly
Not sure who mentioned it but IIRC the OpenShift fork of CAPV has a MHC feature for something like this (@randomvariable @rikatz maybe one of you? I can't find the link anymore)
Not sure it was me. I think I was looking at https://github.com/openshift/enhancements/pull/673/files
Not sure it was me. I think I was looking at openshift/enhancements#673 (files)
That was definitely the link I got, thx!