potential fallout when flux is updated before it is shut down
During a recent Flux upgrade, new packages were installed while Flux was running. These packages removed rc3 as part of the modprobe transition, which later prevented an orderly shutdown when Flux was stopped.
In general, upgrading Flux before a proper shutdown seems like it has a high chance of causing issues. While this case was particularly bad due to rc3 changes (and there will be a similar issue for 0.78.0->0.79.0 transition), there could be other, more subtle issues on other upgrades. For instance, minor behavior or protocol changes that are expected to be consistent within a version, expectations of things that occurred during rc1 do not match rc3, etc.
For the rc1/rc3 consistency issues, one idea would be to store the current rc3 configuration in the KVS during rc1. This would protect the instance from rc3 changes occuring due to an upgrade.
I'm not sure how to solve the other issues in general though.
I may be forgetting but isn't there a way to have an RPM stop a service before updating it and start again after?
Ah, fedora has convenience scriptlets for this: https://docs.fedoraproject.org/en-US/packaging-guidelines/Scriptlets/#_systemd
Good thought!
Ok, the following preun scriptlet has been added to the flux-core RPM:
%preun
# Stop the flux service on both removal (via the systemd_preun() macro)
# and upgrade (via systemctl directly). This prevents errors when stopping
# flux after files have been replaced due to an upgrade or removed due
# to uninstall.
#
%systemd_preun flux.service
if [ $1 -eq 1 ]; then
/usr/bin/systemctl stop flux.service >/dev/null 2>&1 || :
fi
This should stop and disable the flux service on an uninstall before the package is removed, and only stop the service on upgrade.
Great! My only thought is if the stop is taking a long time due to processing a dump, I wonder if it would be useful to have the systemctl output?
That's a good thought. The invocation above was modeled after the systemd provided RPM macros, e.g.:
%systemd_preun() \
if [ $1 -eq 0 ] ; then \
# Package removal, not upgrade \
systemctl --no-reload disable --now %{?*} &>/dev/null || : \
fi \
%{nil}
which strongly implies that no output should be generated. However, in this special case perhaps a different approach is needed.
Aside: TIL the bash redirect extension &>/dev/null is preferred over >/dev/null 2>&1, from bash(1)
There are two formats for redirecting standard output and standard
error:
&>word
and
>&word
Of the two forms, the first is preferred. This is semantically
equivalent to
>word 2>&1
The main reason to suppress errors and output from systemctl stop (which normally seems to be completely silent) is to avoid errors to the rpm or dnf output when the unit is already stopped or disabled. In light of that, how about this version:
%preun
# Stop the flux service on both removal and upgrade if active
if /usr/bin/systemctl is-active --quiet flux.service; then
echo "Stopping Flux systemd unit due to upgrade/removal..."
echo "For progress, check: systemctl status flux"
/usr/bin/systemctl stop flux.service
fi
This will emit a message to the console only if the Flux service is active. Since there's no output directly from systemctl stop, the message directs the admin to check systemctl status flux for progress. Additionally, an error in systemctl stop will now cause the %preun scriplet to fail and hopefully abort the upgrade operation so that recovery can be attempted.
That sounds perfect.