cluster-api-provider-vsphere Bootstrap failure detection

trafficstars

/kind feature

Describe the solution you'd like If bootstrapping doesn't succeed, set VSphereMachine.Status.FailureReason and FailureMessage to indicate there was an error.

Anything else you would like to add: More information in https://github.com/kubernetes-sigs/cluster-api/issues/2554.

We may need to amend the bootstrap provider contract to require them to write a sentinel file indicating success to a specific location because not all bootstrap providers will necessarily use cloud-init and have that as a consistent means for checking success/failure. I will probably write up a separate proposal for that.

We will need a way for the CAPV controller to be able to determine the status for a given VSphereMachine. Does vCenter have any services where the bootstrap logic in the VM could send an update to indicate bootstrap success? Or maybe some way to label or tag the actual VM? This would need to be properly secured, presumably.

May 08 '20 15:05 ncdc

/assign

May 18 '20 13:05 randomvariable

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

Aug 16 '20 13:08 fejta-bot

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

Nov 16 '20 11:11 fejta-bot

This may be doable using VM Guest Tools and govmomi, and could use the bootstrap sentinel feature that CAPZ implements, or maybe write something to the node status.

Aug 03 '23 18:08 randomvariable

/help

Aug 21 '23 11:08 sbueringer

@sbueringer: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Aug 21 '23 11:08 k8s-ci-robot

Slightly more details.

One way to do this is roughly:

If bootstrap failed write some information to guestinfo, example: vmtoolsd --cmd "info-set guestinfo.capv.bootstrap "failed"
If the CAPV controller finds this information in guestinfo it can update the VSphereVM accordingly (details tbd but probably set failureReason/failureMessage)

Aug 23 '23 10:08 sbueringer

The tricky part is probably, if this triggers a VM re-creation via MachineHealthCheck, how do we ensure we don't re-create VMs very quickly

Aug 23 '23 10:08 sbueringer

Not sure who mentioned it but IIRC the OpenShift fork of CAPV has a MHC feature for something like this (@randomvariable @rikatz maybe one of you? I can't find the link anymore)

Aug 23 '23 10:08 sbueringer

Not sure it was me. I think I was looking at https://github.com/openshift/enhancements/pull/673/files

Aug 24 '23 09:08 randomvariable

Not sure it was me. I think I was looking at openshift/enhancements#673 (files)

That was definitely the link I got, thx!

Aug 24 '23 10:08 sbueringer

cluster-api-provider-vsphere cluster-api-provider-vsphere copied to clipboard

Bootstrap failure detection

Guidelines

cluster-api-provider-vsphere
cluster-api-provider-vsphere copied to clipboard