freebsd-wifibox
freebsd-wifibox copied to clipboard
Correctly terminate bhyve VM using kill SIGTERM.
Previously, a kill -SIGTERM would be requested of the daemon, which would forward the signal to /usr/local/sbin/wifibox. /usr/local/sbin/wifibox would not forward the signal to bhyve, which requires the signal to shutdown cleanly.
Also explicitly unlink the tap interface from the wifibox interface, and wait 5 seconds instead of 3 for the VM to shutdown.
Also consistently use subshells.
There is a problem remaining with the patch. vm_manager() calls destroy_vm() always after bhyvectl is started, and possibly destroy_nmdm() and destroy_bridge()
The "problem" is that vm_stop() also calls destroy_vm() unconditionally, and most likely calls destroy_nmdm() and destroy_bridge() too. That was why I had originally used exit in the _kill_child function.
When this happens, various debug errors are exhibited on a normal setup. Namely:
${BHYVECTL} --destroy --vm=${WIFIBOX_VM} 2>&1 | capture_output info bhyvectlwill most likely always be printed to the log,WARNING: No linked tap inteface found for wifibox0will also most likely always be printed whenservice wifibox stopis run.
But, I'm not sure how to best solve this issue. Here's wifibox.log:
2024-08-28T14:16:10+0200 DEBUG Program started as /usr/local/sbin/wifibox, with arguments: stop
2024-08-28T14:16:10+0200 INFO Begin: wifibox stop
2024-08-28T14:16:10+0200 DEBUG stop=GN
2024-08-28T14:16:10+0200 INFO Stopping guest wifibox, managed by PID 8121
2024-08-28T14:16:10+0200 DEBUG SIGTERM received for PID 8121, killing remaining jobs
2024-08-28T14:16:10+0200 INFO Waiting 3 seconds for the guest to stop
2024-08-28T14:16:10+0200 DEBUG job_pid=99306
2024-08-28T14:16:13+0200 INFO Destroying guest wifibox
2024-08-28T14:16:15+0200 INFO Destroying bhyve PPT device: pci3:0:0
2024-08-28T14:16:15+0200 INFO VM manager: guest was powered off, signaling exit
2024-08-28T14:16:15+0200 INFO Removing null-modem devices
2024-08-28T14:16:15+0200 INFO Null-modem devices are removed
2024-08-28T14:16:15+0200 INFO Unlinking tap interface from wifibox0: tap0
2024-08-28T14:16:15+0200 INFO Destroying linked tap interface: tap0
2024-08-28T14:16:15+0200 INFO Destroying bridge interface: wifibox0
2024-08-28T14:16:15+0200 INFO VM manager: quit daemonization
2024-08-28T14:16:15+0200 INFO Forcing shutdown of guest wifibox
2024-08-28T14:16:15+0200 DEBUG [bhyvectl] vm_open: wifibox could not be opened: No such file or directory
2024-08-28T14:16:15+0200 INFO Destroying guest wifibox
2024-08-28T14:16:15+0200 INFO [bhyvectl] vm_open: wifibox could not be opened: No such file or directory
2024-08-28T14:16:16+0200 INFO Destroying bhyve PPT device: pci3:0:0
2024-08-28T14:16:16+0200 INFO Removing null-modem devices
2024-08-28T14:16:16+0200 INFO Null-modem devices are removed
2024-08-28T14:16:16+0200 WARN No linked tap inteface found for wifibox0
2024-08-28T14:16:16+0200 INFO Destroying bridge interface: wifibox0
2024-08-28T14:16:16+0200 DEBUG [ifconfig] ifconfig: interface wifibox0 does not exist
2024-08-28T14:16:16+0200 INFO End: wifibox stop
There is a problem remaining with the patch.
vm_manager()callsdestroy_vm()always after bhyvectl is started, and possiblydestroy_nmdm()anddestroy_bridge()
Yes, I have noticed the same issue while experimenting with your patch and my suggested addition about the adaptive wait for forced shutdown. I think I already have a solution for that. First of all, destroy_nmdm() has just been removed by #105 so that is good. For destroy_bridge(), what worked for me is to remove its invocation from vm_manager(), right before calling quit_daemonization().
For
destroy_bridge(), what worked for me is to remove its invocation fromvm_manager(), right before callingquit_daemonization().
Just to confirm: are you satisfied that there is no condition that the VM manager exits somehow without restarting, and we don't want to call destroy_bridge()? For example if the VM is powered off or is halted for some reason, destroy_bridge() may not be called unless the service is manually stopped/restarted.
Personally, I think it's fine to remove the call. The VM manager does not create the bridge, so it shouldn't destroy it either.
Just to confirm: are you satisfied that there is no condition that the VM manager exits somehow without restarting, and we don't want to call
destroy_bridge()? For example if the VM is powered off or is halted for some reason,destroy_bridge()may not be called unless the service is manually stopped/restarted.
The purpose of the VM manager is to keep the VM running until it is stopped in the expected way. This can cover unexpected crashes or even deliberate reboot issued from the console. Throughout all this time, the network interface has to be available as well. I think placing the destroy_bridge() call there was the result of the bug that this PR is fixing -- I did not notice that the ACPI power-off does not happen for the VM and everything was forcefully removed at the end.
If the VM manager is killed directly from somewhere else in the system, the VM would leave the interface behind. But wifibox can notice the existing interface upon the next start or the leftover resource can be cleaned up separately directly.
Personally, I think it's fine to remove the call. The VM manager does not create the bridge, so it shouldn't destroy it either.
I agree!
I have got some warnings from ShellCheck (by running the make shellcheck command):
In sbin/wifibox line 727:
trap _kill_child SIGTERM
^-----^ SC3048 (warning): In POSIX sh, prefixing signal names with 'SIG' is undefined.
In sbin/wifibox line 738:
trap - SIGTERM
^-----^ SC3048 (warning): In POSIX sh, prefixing signal names with 'SIG' is undefined.
For more information:
https://www.shellcheck.net/wiki/SC3048 -- In POSIX sh, prefixing signal nam...
I've fixed the Shellcheck warnings, and fixed three calls to kill -SIGTERM as well. There was already one kill -TERM, fyi.
Also, today I realized there is a more elegant solution to solving this bug than creating the (local) whole function for it. This patch provides the same functionality:
--- wifibox.old 2024-09-10 04:58:25.754894000 +0200
+++ wifibox.new 2024-09-10 05:04:12.978878000 +0200
@@ -770,7 +770,7 @@
log info "Stopping guest ${WIFIBOX_VM}, managed by PID ${_pid}"
- if ! (${KILL} -SIGTERM "${_pid}" 2>&1 | capture_output debug kill); then
+ if ! (${KILL} -SIGTERM -"${_pid}" 2>&1 | capture_output debug kill); then
log warn "Guest could not be stopped gracefully"
fi
By killing the pgid (which should? be the same as the pid of daemon, we can kill daemon, bhyve, and sh all at the same time:
$ ps ax -o pid,pgid,comm | grep 23442
23442 23442 daemon
24815 23442 sh
87148 23442 bhyve
87640 23442 sh
Up to you how you'd like to proceed, @pgj. I'd be fine keeping this patch as-is, or reverting all changes in the vm_manager() function except for the removal of the call to destroy_bridge and the fixed indent for done.
I've fixed the Shellcheck warnings, and fixed three calls to
kill -SIGTERMas well. There was already onekill -TERM, fyi.
Thanks! Curiously, ShellCheck did not give a warning for other instances :shrug:
[T]oday I realized there is a more elegant solution to solving this bug than creating the (local) whole function for it.
I like this solution! Let us do that, and I can then test it locally.
That should do it. Please test.
The approach works nicely. But there is a new ShellCheck warning:
In sbin/wifibox line 748:
_gpid="$(${PS} -o pgid -p $_pid | ${TAIL} -1)"
^---^ SC2086 (info): Double quote to prevent globbing and word splitting.
Did you mean:
_gpid="$(${PS} -o pgid -p "$_pid" | ${TAIL} -1)"
For more information:
https://www.shellcheck.net/wiki/SC2086 -- Double quote to prevent globbing ...
Great! With this patch and all of the others applied, Wifibox now works 100% on my system. Feel free to add it to the supported list.
CPU: Core i7-4870HQ CPU @ 2.50GHz NIC: BCM43602 802.11ac Wireless LAN SoC Model: Apple MacBookPro11,4
RECOVER_SUSPEND_VMM is required for suspend/resume to work. Tested on 13.3-release.