kubitect icon indicating copy to clipboard operation
kubitect copied to clipboard

reboot nodes after cloudinit

Open bsdlp opened this issue 1 year ago • 6 comments

while debugging #159 i noticed that after the package upgrades that kubitect applies when updateOnBoot is true, ubuntu reports that it needs a reboot to apply those upgrades. this change reboots the nodes using cloudinit after everything is done.

i'd also like to get your thoughts on whether or not this should be configurable / if updateOnBoot should be the one that sets reboot powerstate

bsdlp avatar Jul 10 '23 00:07 bsdlp

I like the idea of rebooting VMs if updateOnBoot is true.

We could simply set in cloud-init:

power_state:
  mode: reboot
  condition: ${update}

However, a reboot can cause an error in remot-exec that waits for cloud-init to finish. So we need to figure out, how to prevent this.

MusicDin avatar Jul 10 '23 18:07 MusicDin

One way of doing it would be to duplicate the remote-exec block within the vm_domain. The first provisioner can be set to ignore the error (on_failure: continue) and the second one should catch it (on_failure: fail).

This way, if the reboot happens, the error (lost connection) will be ignored, and the second provisioner will wait for cloud-init to finish. If cloud-init does not finish, the second provisioner will throw an error.

In such case, the second provisioner should have the timeout set to 1 minute or something like that.


This is just a quick-fix that would probably work, but I would prefer a more reliable approach (if possible).

MusicDin avatar Jul 10 '23 18:07 MusicDin

reading cloud-init documentation, it looks like cloud-init will report that it is done before the reboot is executed. as such, we should block on cloud-init status --wait and that should be sufficient. alternatively we can keep the logfile parsing loop but set a delay on powerstate to something greater than 2s to make sure that the remote-exec doesn't get interrupted by reboot. (i would prefer moving to calling the cloud-init cli)

from power state docs:

Using this module ensures that cloud-init is entirely finished with modules that would be executed.

fwiw i've initialized my nodes probably 100 times over the past week with reboot mode and haven't run into any issues

bsdlp avatar Jul 13 '23 00:07 bsdlp

I would prefer the cloud-init --wait command over the current log-parsing approach, with the addition of redirecting the standard output to /dev/null. Otherwise, cloud-init command outputs a dot every second.

cloud-init --wait > /dev/null

Regarding the reboot, I always run into the following error:

wait: remote command exited without exit status or exit signal

To me, it seems that either the VM is terminated too quickly or cloud-init reports the exit status too late. However, status: done is outputed before the error is raised.

I suggest that we add the second remote provisioner, where the first one will be ignored on error and the second one will connect to the VM once its rebooted. This way we also ensure that VMs are properly started after the reboot. Any thoughts on that?

MusicDin avatar Jul 31 '23 18:07 MusicDin

I think that cloud-init terminates the VM just a bit too fast, which produces the above error occasionally. I've tried setting the reboot delay of 1 second, which passed all of my tests successfully. Setting it to 3 or maybe even 5 seconds should be reliable enough for now:

power_state:
  mode: reboot
  condition: ${update}
  delay: 5

Any thoughts against this approach?

MusicDin avatar Aug 02 '23 09:08 MusicDin

ah - just realized that delay is in minutes - would a delay of 1 minute make more sense?

https://cloudinit.readthedocs.io/en/latest/reference/modules.html#power-state-change

bsdlp avatar Aug 14 '23 08:08 bsdlp