cluster-api-provider-openstack icon indicating copy to clipboard operation
cluster-api-provider-openstack copied to clipboard

✨ Shutdown VMs before Deletion

Open shaardie opened this issue 1 month ago • 17 comments

Tries to shutdown the OpenStack VM before deleting it. This way even Pods form Daemonsets are shut down more gracefully and services like license daemons on the VMs can be properly shutdown.

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged): Fixes #1973

Special notes for your reviewer:

  1. Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

  • [x] squashed commits
  • if necessary:
    • [ ] includes documentation
    • [ ] adds unit tests

/hold

shaardie avatar Nov 14 '25 15:11 shaardie

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign vincepri for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Nov 14 '25 15:11 k8s-ci-robot

Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!

Name Link
Latest commit 47154eaa67acd0d99630f91dcdc57c6c3f6b6b5b
Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-cluster-api-openstack/deploys/691ed07141ecd2000816910a
Deploy Preview https://deploy-preview-2835--kubernetes-sigs-cluster-api-openstack.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

netlify[bot] avatar Nov 14 '25 15:11 netlify[bot]

Hi @shaardie. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Nov 14 '25 15:11 k8s-ci-robot

/retitle ✨ Shutdown VMs before Deletion

lentzi90 avatar Nov 19 '25 15:11 lentzi90

/ok-to-test Thinking about unnecessary API calls, should we skip trying to shut it down all together if the timeout is 0?

Which timeout do you mean exactly? timeoutInstanceDelete seems to be hardcoded to 5min.

shaardie avatar Nov 20 '25 08:11 shaardie

Oh right, I got the 0 from the issue description. But the question is still relevant. I think users should be able to opt out of this, especially since this adds more API calls.

lentzi90 avatar Nov 20 '25 08:11 lentzi90

Oh right, I got the 0 from the issue description. But the question is still relevant. I think users should be able to opt out of this, especially since this adds more API calls.

So you suggest a new configuration option via CRD?

shaardie avatar Nov 20 '25 09:11 shaardie

Hmm let me gather some second opinions. I want to have more than a gut feeling before we start modifying the CRDs :smile:

lentzi90 avatar Nov 20 '25 11:11 lentzi90

I'm not sure how I feel about failing if the system doesn't shut down, I feel like it would be better if the it tries to shut it down for 5 minutes, and if it doesn't shut down, it moves on to termination.

Anyways, OpenStack will flip from a graceful to hard shutdown after 60s by default:

https://docs.openstack.org/nova/rocky/configuration/config.html#DEFAULT.graceful_shutdown_timeout

So the 5 minute timeout seems overkill as well, unless something is seriously wrong (or the cloud has that config changed).

mnaser avatar Nov 21 '25 20:11 mnaser

I can also change the PR to continue with deleting the VM instead of failing after the period of time.

For me personally 60s would also be okay for a timeout, but I can think of situations where this can be a little bit short. For example, if there are some custom mounts of nfs, cifs, gpfs, what so ever. This can easily take more than 60s to shutdown.

Maybe you should first decide, if you want to have this value configurable via CRD?

shaardie avatar Nov 25 '25 10:11 shaardie

I have checked with my downstream and they do not have any concerns with the feature (always enabled).

However, it sounds like there are quite many ways to do things and people will want different things. Some do not care about the shutdown and definitely want to force it or just straight delete. Some want to make sure everything is properly shut down, rather error than force. And some will want a different timeout.

So how should we do this? I can see it working with either a flag or CRD field(s).

Then we have one more thing to consider. We want to make use of ORC for managing the servers. See https://github.com/kubernetes-sigs/cluster-api-provider-openstack/issues/2814 for more details. Hopefully we can get this done sooner rather than later, which means that this feature would make more sense to implement in ORC directly. Otherwise we will end up having to migrate it later.

lentzi90 avatar Nov 25 '25 10:11 lentzi90

I am not quite sure what you want me to do honestly. I would be happy to change stuff on this PR, if you tell me what you want to have.

If you want to migrate to your new setup first, I would probably use my patched version for now and see, if I re-write the whole thing again, when you migration to ORC is done.

shaardie avatar Nov 25 '25 13:11 shaardie

So how should we do this?

Human interaction analogy

Let's go through the scenarios where Alice (User) wants Bob (CAPO) to delete a VM in a "regular" talk to your colleague kind of interaction:

  • If Alice tells Bob "Please shut down this VM for me, i need one less now" and gives no further details, Bob will have to go with what the best practice is, and will perform an ordinary shutdown, giving the OS some time to properly terminate processes.
    • If the normal shutdown does not proceed as planned, a human operator would typically ask Alice for confirmation whether she would agree to a forced shutdown
    • An automated process can not do this, so it has to do what minimises the deviation from the desired state (VM off, no improperly shut down processes causing trouble) while maximising the velocity of reaching the desired state. We know, that we can not maximise both, but (at least from my perspective) there are many things that can go wrong when we immediately shut everything off without proper cleanup, while the repercussions of a delay of 60 seconds before we get the big stick seem rather less stark in comparison. So any delay should be better than no delay in the majority of cases
  • If Alice knows, that the VM will take a long time to shut down normally, she should tell Bob this information, so that he does not get surprised.
  • If Alice wants Bob to immediately pull the plug on the machine instead of performing the usual shutdown, she should tell him this beforehand, because she can not expect this to be his regular modus operandi

In all of the above cases, Alice should provide Bob with the information he needs to proceed in an ideal way. To me, this hints in the direction of Alice (User) providing this information to Bob (CAPO) beforehand, in a way Bob understands (CRD field). If Bob tries to have one solution that applies to all possible use cases (Configuration Flag) he might get some cases wrong, in which Alice has different requirements.

You might also have the case that you have one CAPO instance managing VMs that you want to be deleted immediately as well as VMs that you want to give time for an orderly shutdown. That would also make the CRD field approach more desirable.

Which value should the feature use by default

In my opinion, the Venn diagram representing the group of people for whom one minute of additional VM runtime would be more than even a minor inconvenience (which could then be fixed easily) and the group of people who would be caught unaware of such a change should have a very small intersection.

Whereas with the way things currently work, the Venn diagram representing the group of people for whom an immediate VM termination would be more than even a minor inconvenience (which may or may not be easily remedied) and the group of people who might be bitten by this in the future probably has a larger intersection (in my opinion).

So i think under those conditions, there is no need to treat the previous default (which is unusual and can definetly cause headaches) with a lot of reverence. I think 60 seconds before forcing termination (which may then be adjusted for individual VMs with special considerations) is a reasonable default. If people never want their VMs to be force terminated, set it to -1, and if they want them terminated immediately, set it to0.

But this is just my opinion, just trying to give some input to give you a perspective on the choices you mentioned.

Atomsoldat avatar Nov 25 '25 19:11 Atomsoldat

I am not quite sure what you want me to do honestly. I would be happy to change stuff on this PR, if you tell me what you want to have.

If you want to migrate to your new setup first, I would probably use my patched version for now and see, if I re-write the whole thing again, when you migration to ORC is done.

I am basically saying that I think we need an option to either turn this feature off or to allow more granular configuration of it. I do not have a strong opinion on how to do that so I am leaving it up to you to propose what to do. The issue description already suggests a waitForShutdown field. That sounds reasonable and other people seem to agree also.

If you don't need this urgently, I also suggest looking into ORC first so that we can get an implementation that will work with it. Otherwise we risk breaking this feature later.

lentzi90 avatar Nov 28 '25 06:11 lentzi90

As we are expecting to have graceful shut down. Should we follow the following steps so that ungraceful shutdown doesn't happen:

  1. If VM is running,
  2. Issue a "stop" or "poweroff" command via the OpenStack API.
  3. Wait for the VM to reach "SHUTOFF" state.
  4. Delete the instance as usual. I can see the stop server option, but is it syncing with the deletion of the instance?

smoshiur1237 avatar Dec 01 '25 14:12 smoshiur1237

I can see the stop server option, but is it syncing with the deletion of the instance?

I am not quite sure, what you mean. Currently the VM is only deleted without any shutdown. With this PR the controller first triggers a server stop, waits for it to finish and then deletes the VM.

shaardie avatar Dec 15 '25 10:12 shaardie

I am basically saying that I think we need an option to either turn this feature off or to allow more granular configuration of it. I do not have a strong opinion on how to do that so I am leaving it up to you to propose what to do. The issue description already suggests a waitForShutdown field. That sounds reasonable and other people seem to agree also.

If you don't need this urgently, I also suggest looking into ORC first so that we can get an implementation that will work with it. Otherwise we risk breaking this feature later.

Okay, so for myself, I have now a running version with shutdown. So I do not need this ASAP. Since you currently update to ORC, I will wait for this transition to happen and update this PR then with a Version which includes a configuration.

shaardie avatar Dec 15 '25 10:12 shaardie