nhc
nhc copied to clipboard
Allow NHC to work with recent changes to Slurm reboot
Add boot node state to node online and offline script.
Properly handle scontrol reboot asap so Slurm doesn't erroneously
online the node after the first NHC call after a reboot.
Make sure node is always offlined after reboot until NHC onlines it,
regardless whether scontrol reboot or scontrol reboot asap ran.
Prevent onlining a node while there is a pending reboot.
For context, see https://github.com/mej/nhc/issues/81 and https://bugs.schedmd.com/show_bug.cgi?id=6391
This was a rework of pull request #83. The main differences from #83 are:
- prevent NHC from onlining nodes that are pending reboot
- make sure that a vanilla
scontrol reboot(i.e. noasapmodifier) still keeps things in drain after reboot until NHC sees fit to online the node.
I've pushed our internal version to my fork: https://github.com/mej/nhc/pull/65/files. It's a completely different helper, and it's activated by setting export OFFLINE_NODE="$HELPERDIR/node-mark-reboot" above the tests that should trigger a reboot. You might find something useful there. I will check your version as well.
So I see two separate and distinct steps to address the Slurm reboot functionality handling being missing from NHC:
- Add handling to ensure that the
bootstate is recognized and handled intelligently in terms of onlining/offlining the nodes, but without any "integration," per se. - Provide additional functionality that integrates with Slurm's
scontrol rebootfeature so that all pre-, post-, and during-reboot states are not only dealt with correctly but can also facilitate unique capabilities that don't yet exist.
Since the current contents of the nhc/dev branch have been thoroughly tested and stable, I don't plan to make very many changes to that code prior to release. So if there's a way to address item 1 (make NHC behave sanely with respect to the boot state(s)) in the short term for the 1.4.3 release now, given that we would then address item 2 for the subsequent 1.4.4 release, that would be ideal.
I will freely admit that, at this point, I have not ever used the scontrol reboot feature, so I'm open to input from both of you (@hintron and @martijnkruiten) regarding how each of these is best handled. At the moment, I'm leaning toward merging in @martijnkruiten's #65 changes for 1.4.3 and then possibly taking an approach closer to the changes here in #84 for the 1.4.4 release. I think the separate node-mark-reboot script, along with the minimal changes to existing code, is the best bet for getting 1.4.3 out the door in the safest way possible.
Your thoughts?
@mej I am fine with the timeline you proposed. This patch was just to get things working with minimal changes to NHC. It's possible there is a better way to architect it.
@hintron I was testing this patch on our NHC deployment and noticed some of the logic is ignored if I give a custom reason= to scontrol during the reboot. We tend to do this so we can have a meaningful reason why the reboot is taking place rather than relying on the default Reboot ASAP message getting set by SLURM.
Example:
scontrol reboot ASAP nextstate=resume reason='KEY_INFO - reboot to test NHC changes' p0002
We use KEY_INFO as a way to tell our monitoring to ignore this drained node since we generate alerts in Prometheus when nodes are drained for reasons other than KEY_INFO or NHC:.
I also noticed that if I set nextstate=RESUME when I do scontrol reboot ASAP the node still comes up unhealthy without NHC marking the node offline , I am still testing what exactly is happening but in my test environment I have a systemd unit file override to run NHC as a ExecStartPost so that NHC runs right after slurmd starts. It seems this patch assumes nextstate=DRAIN which might be fine given NHC will then be able to resume the node.
# /etc/systemd/system/slurmd.service.d/nhc.conf
# File managed by Puppet
[Service]
ExecStartPost=-/usr/sbin/nhc
I am on SLURM 20.11.7, and still testing this patch, but so far is a very welcome feature to NHC as we use SLURM for rolling reboots and are having issues with jobs starting after reboot before GPFS is mounted.
@hintron I was testing this patch on our NHC deployment and noticed some of the logic is ignored if I give a custom
reason=to scontrol during the reboot. We tend to do this so we can have a meaningful reason why the reboot is taking place rather than relying on the defaultReboot ASAPmessage getting set by SLURM.
I am on SLURM 20.11.7, and still testing this patch, but so far is a very welcome feature to NHC as we use SLURM for rolling reboots and are having issues with jobs starting after reboot before GPFS is mounted.
Good to hear!
This patch was originally written to target 18.08 (I believe), and we've been continually tweaking the Slurm boot logic since, so I am pleasantly surprised that it is still useful. I'm guessing that it needs to be updated a bit to solve the problem you are mentioning (though it's possible it was always flawed).
Unfortunately, I don't have the go-ahead to work on this further, so I'll have to leave further development to someone else. Note that 21.08 is also making some changes in node states and how they are printed out, which may affect this as well.
Aside from my minor tweak, this works with SLURM 20.11.7 by doing something like this:
scontrol reboot ASAP nextstate=down <nodes>
The default of nextstate=resume prevents this extra reboot logic from ever getting used. Also as noted previously, can't pass a reason to the reboot command, have to let the default reason be used.
In case anyone else comes across this, another thing we had to change:
--- node-mark-offline.20210824 2021-07-30 10:45:41.518514000 -0400
+++ node-mark-offline 2021-08-24 16:17:17.380167000 -0400
@@ -73,6 +73,10 @@
echo "$0: Not offlining $HOSTNAME: Already offline with no note set."
exit 0
fi
+ if [[ "$OLD_NOTE_LEADER" == "Reboot" && "$OLD_NOTE" == "ASAP" ]]; then
+ echo "$0: Not offlining $HOSTNAME: Pending reboot."
+ exit 0
+ fi
;;
boot*)
# Offline node after reboot if vanilla `scontrol reboot` was
This avoids NHC clearing the reboot state of a node while it's pending reboot if that node has issues while waiting for the reboot. I plan to open new pull request based off this one once we've had a bit more time to thoroughly test things out.