freebsd-wifibox icon indicating copy to clipboard operation
freebsd-wifibox copied to clipboard

Correctly terminate bhyve VM using kill SIGTERM.

Open MegaManSec opened this issue 1 year ago • 10 comments

Previously, a kill -SIGTERM would be requested of the daemon, which would forward the signal to /usr/local/sbin/wifibox. /usr/local/sbin/wifibox would not forward the signal to bhyve, which requires the signal to shutdown cleanly.

Also explicitly unlink the tap interface from the wifibox interface, and wait 5 seconds instead of 3 for the VM to shutdown.

Also consistently use subshells.

MegaManSec avatar Aug 24 '24 14:08 MegaManSec

There is a problem remaining with the patch. vm_manager() calls destroy_vm() always after bhyvectl is started, and possibly destroy_nmdm() and destroy_bridge()

The "problem" is that vm_stop() also calls destroy_vm() unconditionally, and most likely calls destroy_nmdm() and destroy_bridge() too. That was why I had originally used exit in the _kill_child function.

When this happens, various debug errors are exhibited on a normal setup. Namely:

  1. ${BHYVECTL} --destroy --vm=${WIFIBOX_VM} 2>&1 | capture_output info bhyvectl will most likely always be printed to the log,
  2. WARNING: No linked tap inteface found for wifibox0 will also most likely always be printed when service wifibox stop is run.

But, I'm not sure how to best solve this issue. Here's wifibox.log:

2024-08-28T14:16:10+0200 DEBUG Program started as /usr/local/sbin/wifibox, with arguments: stop
2024-08-28T14:16:10+0200 INFO  Begin: wifibox stop
2024-08-28T14:16:10+0200 DEBUG stop=GN
2024-08-28T14:16:10+0200 INFO  Stopping guest wifibox, managed by PID 8121
2024-08-28T14:16:10+0200 DEBUG SIGTERM received for PID 8121, killing remaining jobs
2024-08-28T14:16:10+0200 INFO  Waiting 3 seconds for the guest to stop
2024-08-28T14:16:10+0200 DEBUG job_pid=99306
2024-08-28T14:16:13+0200 INFO  Destroying guest wifibox
2024-08-28T14:16:15+0200 INFO  Destroying bhyve PPT device: pci3:0:0
2024-08-28T14:16:15+0200 INFO  VM manager: guest was powered off, signaling exit
2024-08-28T14:16:15+0200 INFO  Removing null-modem devices
2024-08-28T14:16:15+0200 INFO  Null-modem devices are removed
2024-08-28T14:16:15+0200 INFO  Unlinking tap interface from wifibox0: tap0
2024-08-28T14:16:15+0200 INFO  Destroying linked tap interface: tap0
2024-08-28T14:16:15+0200 INFO  Destroying bridge interface: wifibox0
2024-08-28T14:16:15+0200 INFO  VM manager: quit daemonization
2024-08-28T14:16:15+0200 INFO  Forcing shutdown of guest wifibox
2024-08-28T14:16:15+0200 DEBUG [bhyvectl] vm_open: wifibox could not be opened: No such file or directory
2024-08-28T14:16:15+0200 INFO  Destroying guest wifibox
2024-08-28T14:16:15+0200 INFO  [bhyvectl] vm_open: wifibox could not be opened: No such file or directory
2024-08-28T14:16:16+0200 INFO  Destroying bhyve PPT device: pci3:0:0
2024-08-28T14:16:16+0200 INFO  Removing null-modem devices
2024-08-28T14:16:16+0200 INFO  Null-modem devices are removed
2024-08-28T14:16:16+0200 WARN  No linked tap inteface found for wifibox0
2024-08-28T14:16:16+0200 INFO  Destroying bridge interface: wifibox0
2024-08-28T14:16:16+0200 DEBUG [ifconfig] ifconfig: interface wifibox0 does not exist
2024-08-28T14:16:16+0200 INFO  End: wifibox stop

MegaManSec avatar Aug 28 '24 12:08 MegaManSec

There is a problem remaining with the patch. vm_manager() calls destroy_vm() always after bhyvectl is started, and possibly destroy_nmdm() and destroy_bridge()

Yes, I have noticed the same issue while experimenting with your patch and my suggested addition about the adaptive wait for forced shutdown. I think I already have a solution for that. First of all, destroy_nmdm() has just been removed by #105 so that is good. For destroy_bridge(), what worked for me is to remove its invocation from vm_manager(), right before calling quit_daemonization().

pgj avatar Aug 28 '24 19:08 pgj

For destroy_bridge(), what worked for me is to remove its invocation from vm_manager(), right before calling quit_daemonization().

Just to confirm: are you satisfied that there is no condition that the VM manager exits somehow without restarting, and we don't want to call destroy_bridge()? For example if the VM is powered off or is halted for some reason, destroy_bridge() may not be called unless the service is manually stopped/restarted.

Personally, I think it's fine to remove the call. The VM manager does not create the bridge, so it shouldn't destroy it either.

MegaManSec avatar Sep 01 '24 00:09 MegaManSec

Just to confirm: are you satisfied that there is no condition that the VM manager exits somehow without restarting, and we don't want to call destroy_bridge()? For example if the VM is powered off or is halted for some reason, destroy_bridge() may not be called unless the service is manually stopped/restarted.

The purpose of the VM manager is to keep the VM running until it is stopped in the expected way. This can cover unexpected crashes or even deliberate reboot issued from the console. Throughout all this time, the network interface has to be available as well. I think placing the destroy_bridge() call there was the result of the bug that this PR is fixing -- I did not notice that the ACPI power-off does not happen for the VM and everything was forcefully removed at the end.

If the VM manager is killed directly from somewhere else in the system, the VM would leave the interface behind. But wifibox can notice the existing interface upon the next start or the leftover resource can be cleaned up separately directly.

Personally, I think it's fine to remove the call. The VM manager does not create the bridge, so it shouldn't destroy it either.

I agree!

pgj avatar Sep 03 '24 04:09 pgj

I have got some warnings from ShellCheck (by running the make shellcheck command):

In sbin/wifibox line 727:
    trap _kill_child SIGTERM
                     ^-----^ SC3048 (warning): In POSIX sh, prefixing signal names with 'SIG' is undefined.


In sbin/wifibox line 738:
    trap - SIGTERM
           ^-----^ SC3048 (warning): In POSIX sh, prefixing signal names with 'SIG' is undefined.

For more information:
  https://www.shellcheck.net/wiki/SC3048 -- In POSIX sh, prefixing signal nam...

pgj avatar Sep 08 '24 07:09 pgj

I've fixed the Shellcheck warnings, and fixed three calls to kill -SIGTERM as well. There was already one kill -TERM, fyi.

MegaManSec avatar Sep 10 '24 02:09 MegaManSec

Also, today I realized there is a more elegant solution to solving this bug than creating the (local) whole function for it. This patch provides the same functionality:

--- wifibox.old	2024-09-10 04:58:25.754894000 +0200
+++ wifibox.new	2024-09-10 05:04:12.978878000 +0200
@@ -770,7 +770,7 @@
 
     log info "Stopping guest ${WIFIBOX_VM}, managed by PID ${_pid}"
 
-    if ! (${KILL} -SIGTERM "${_pid}" 2>&1 | capture_output debug kill); then
+    if ! (${KILL} -SIGTERM -"${_pid}" 2>&1 | capture_output debug kill); then
 	log warn "Guest could not be stopped gracefully"
     fi

By killing the pgid (which should? be the same as the pid of daemon, we can kill daemon, bhyve, and sh all at the same time:

$ ps ax -o pid,pgid,comm  | grep 23442
23442 23442 daemon
24815 23442 sh
87148 23442 bhyve
87640 23442 sh

Up to you how you'd like to proceed, @pgj. I'd be fine keeping this patch as-is, or reverting all changes in the vm_manager() function except for the removal of the call to destroy_bridge and the fixed indent for done.

MegaManSec avatar Sep 10 '24 03:09 MegaManSec

I've fixed the Shellcheck warnings, and fixed three calls to kill -SIGTERM as well. There was already one kill -TERM, fyi.

Thanks! Curiously, ShellCheck did not give a warning for other instances :shrug:

pgj avatar Sep 10 '24 05:09 pgj

[T]oday I realized there is a more elegant solution to solving this bug than creating the (local) whole function for it.

I like this solution! Let us do that, and I can then test it locally.

pgj avatar Sep 10 '24 05:09 pgj

That should do it. Please test.

MegaManSec avatar Sep 10 '24 15:09 MegaManSec

The approach works nicely. But there is a new ShellCheck warning:

In sbin/wifibox line 748:
    _gpid="$(${PS} -o pgid -p $_pid | ${TAIL} -1)"
                              ^---^ SC2086 (info): Double quote to prevent globbing and word splitting.

Did you mean: 
    _gpid="$(${PS} -o pgid -p "$_pid" | ${TAIL} -1)"

For more information:
  https://www.shellcheck.net/wiki/SC2086 -- Double quote to prevent globbing ...

pgj avatar Sep 11 '24 04:09 pgj

Great! With this patch and all of the others applied, Wifibox now works 100% on my system. Feel free to add it to the supported list.

CPU: Core i7-4870HQ CPU @ 2.50GHz NIC: BCM43602 802.11ac Wireless LAN SoC Model: Apple MacBookPro11,4

RECOVER_SUSPEND_VMM is required for suspend/resume to work. Tested on 13.3-release.

MegaManSec avatar Sep 15 '24 04:09 MegaManSec