nhc Allow NHC to work with recent changes to Slurm reboot

Add boot node state to node online and offline script. Properly handle scontrol reboot asap so Slurm doesn't erroneously online the node after the first NHC call after a reboot. Make sure node is always offlined after reboot until NHC onlines it, regardless whether scontrol reboot or scontrol reboot asap ran. Prevent onlining a node while there is a pending reboot.

For context, see https://github.com/mej/nhc/issues/81 and https://bugs.schedmd.com/show_bug.cgi?id=6391

Apr 23 '19 21:04 hintron

This was a rework of pull request #83. The main differences from #83 are:

prevent NHC from onlining nodes that are pending reboot
make sure that a vanilla scontrol reboot (i.e. no asap modifier) still keeps things in drain after reboot until NHC sees fit to online the node.

Apr 23 '19 21:04 hintron

I've pushed our internal version to my fork: https://github.com/mej/nhc/pull/65/files. It's a completely different helper, and it's activated by setting export OFFLINE_NODE="$HELPERDIR/node-mark-reboot" above the tests that should trigger a reboot. You might find something useful there. I will check your version as well.

May 22 '20 14:05 martijnkruiten

So I see two separate and distinct steps to address the Slurm reboot functionality handling being missing from NHC:

Add handling to ensure that the boot state is recognized and handled intelligently in terms of onlining/offlining the nodes, but without any "integration," per se.
Provide additional functionality that integrates with Slurm's scontrol reboot feature so that all pre-, post-, and during-reboot states are not only dealt with correctly but can also facilitate unique capabilities that don't yet exist.

Since the current contents of the nhc/dev branch have been thoroughly tested and stable, I don't plan to make very many changes to that code prior to release. So if there's a way to address item 1 (make NHC behave sanely with respect to the boot state(s)) in the short term for the 1.4.3 release now, given that we would then address item 2 for the subsequent 1.4.4 release, that would be ideal.

I will freely admit that, at this point, I have not ever used the scontrol reboot feature, so I'm open to input from both of you (@hintron and @martijnkruiten) regarding how each of these is best handled. At the moment, I'm leaning toward merging in @martijnkruiten's #65 changes for 1.4.3 and then possibly taking an approach closer to the changes here in #84 for the 1.4.4 release. I think the separate node-mark-reboot script, along with the minimal changes to existing code, is the best bet for getting 1.4.3 out the door in the safest way possible.

Your thoughts?

Aug 17 '20 02:08 mej

@mej I am fine with the timeline you proposed. This patch was just to get things working with minimal changes to NHC. It's possible there is a better way to architect it.

Aug 17 '20 16:08 hintron

@hintron I was testing this patch on our NHC deployment and noticed some of the logic is ignored if I give a custom reason= to scontrol during the reboot. We tend to do this so we can have a meaningful reason why the reboot is taking place rather than relying on the default Reboot ASAP message getting set by SLURM.

Example:

scontrol reboot ASAP nextstate=resume reason='KEY_INFO - reboot to test NHC changes' p0002

We use KEY_INFO as a way to tell our monitoring to ignore this drained node since we generate alerts in Prometheus when nodes are drained for reasons other than KEY_INFO or NHC:.

I also noticed that if I set nextstate=RESUME when I do scontrol reboot ASAP the node still comes up unhealthy without NHC marking the node offline , I am still testing what exactly is happening but in my test environment I have a systemd unit file override to run NHC as a ExecStartPost so that NHC runs right after slurmd starts. It seems this patch assumes nextstate=DRAIN which might be fine given NHC will then be able to resume the node.

# /etc/systemd/system/slurmd.service.d/nhc.conf
# File managed by Puppet
[Service]
ExecStartPost=-/usr/sbin/nhc

I am on SLURM 20.11.7, and still testing this patch, but so far is a very welcome feature to NHC as we use SLURM for rolling reboots and are having issues with jobs starting after reboot before GPFS is mounted.

Jul 28 '21 19:07 treydock

@hintron I was testing this patch on our NHC deployment and noticed some of the logic is ignored if I give a custom reason= to scontrol during the reboot. We tend to do this so we can have a meaningful reason why the reboot is taking place rather than relying on the default Reboot ASAP message getting set by SLURM.

I am on SLURM 20.11.7, and still testing this patch, but so far is a very welcome feature to NHC as we use SLURM for rolling reboots and are having issues with jobs starting after reboot before GPFS is mounted.

Good to hear!

This patch was originally written to target 18.08 (I believe), and we've been continually tweaking the Slurm boot logic since, so I am pleasantly surprised that it is still useful. I'm guessing that it needs to be updated a bit to solve the problem you are mentioning (though it's possible it was always flawed).

Unfortunately, I don't have the go-ahead to work on this further, so I'll have to leave further development to someone else. Note that 21.08 is also making some changes in node states and how they are printed out, which may affect this as well.

Jul 28 '21 19:07 hintron

Aside from my minor tweak, this works with SLURM 20.11.7 by doing something like this:

scontrol reboot ASAP nextstate=down <nodes>

The default of nextstate=resume prevents this extra reboot logic from ever getting used. Also as noted previously, can't pass a reason to the reboot command, have to let the default reason be used.

Jul 29 '21 17:07 treydock

In case anyone else comes across this, another thing we had to change:

--- node-mark-offline.20210824  2021-07-30 10:45:41.518514000 -0400
+++ node-mark-offline   2021-08-24 16:17:17.380167000 -0400
@@ -73,6 +73,10 @@
                         echo "$0:  Not offlining $HOSTNAME:  Already offline with no note set."
                         exit 0
                     fi
+                    if [[ "$OLD_NOTE_LEADER" == "Reboot" && "$OLD_NOTE" == "ASAP" ]]; then
+                        echo "$0:  Not offlining $HOSTNAME:  Pending reboot."
+                        exit 0
+                    fi
                     ;;
                 boot*)
                     # Offline node after reboot if vanilla `scontrol reboot` was

This avoids NHC clearing the reboot state of a node while it's pending reboot if that node has issues while waiting for the reboot. I plan to open new pull request based off this one once we've had a bit more time to thoroughly test things out.

Aug 24 '21 20:08 treydock

nhc nhc copied to clipboard

Allow NHC to work with recent changes to Slurm reboot

nhc
nhc copied to clipboard