nhc
nhc copied to clipboard
LBNL Node Health Check
Add `boot` node state to node online and offline script. Properly handle `scontrol reboot asap` so Slurm doesn't erroneously online the node after the first NHC call after a reboot....
Reduces number of access to /proc/cpuinfo in the nhc_hw_gather_data function. For processors like Intel Phi KNL with 256+ threads, the function was taking over 40 seconds to return. Running on...
The README.md file says that RPMs are available also for EL8 (AlmaLinux etc.), but no EL8 RPM package is found. Can you please provide the EL8 RPM? I'm lacking instructions...
the check `* || check_ps_service -u root -S sshd` fails on Ubuntu 20.04. I know `sshd` running on the node because I logged in it with `ssh` and `systemctl is-active...
Starting with `bash` 5.1.0, the test will fail with: ``` *snip* bash variable sanity check...failed 4/7 TEST FAILED: BASH built-in variable $BASH_REMATCH: declare -a BASH_REMATCH=() does not match declare -ar...
We have some GPU nodes, and I would like to make the NVIDIA_HEALTHMON check in nhc.conf. Unfortunately, it seems that Nvidia no longer offers the nvidia-healthmon (at least I was...
It would be great if there were a way to specify a minimum rate for IB (we have a mixture of 40 and 56). check_hw_ib check_hw_ib rate [device] check_hw_ib determines...
It seems that because check_ps_unauth_users returns an error string containing a ', it causes a parsing error in nhc. Below includes a snippet of my nhc.conf and the nhc log...
If NHC is used as an SGE load sensor with syslogging, it currently spams syslog with a message on each run until the problem is resolved. This change avoids sending...
When issuing a nhc with check_hw_ib, the ouput when an error is found is: `ERROR: nhc: Health check failed: check_hw_ib: No IB port hfi1_0:1 is ACTIVE (LinkUp 100 Gb/sec).` I...