nhc
nhc copied to clipboard
LBNL Node Health Check
Add `check_hw_numa` to verify NUMA configuration on the system, _ie_ the number of NUMA nodes, and number of NUMA-nodes-per-socket (NPS)
To be honest, I'm not exactly sure if this is because GPFS is doing something non-standard, or this would happen with any stale remote filesystem type. ``` [root@node001 ~]# nhc...
Using nhc 1.4.2, I have several check_fs_used for our grid nodes. Works great most of the time, auto draining the nodes when the utilization gets to high. However, when a...
This check is yet another way we're trying to identify when the Kernel has leaked memory and thus the node needs to be rebooted. Check we're using looking for SUnreclaim...
Unmodified install dumps following error when calling nhc-genconf: ```bash [root@yslogin6 ~]# /usr/local/sbin/nhc-genconf -H '*' -c - # NHC Configuration File # # Lines are in the form "||" # Hostmask...
Hi Michael, We have a number of backup servers and storage servers with up to 5-10-20 mounted logical volumes, and for each server and each file system I configure in...
When checking the root file system on CentOS 7 with lbnl-nhc 1.4.2, df is invoked with the '-a' flag, causing output similar to the following: ``` Filesystem Type 1K-blocks Used...
When doing some check in NHC that interacts with a filesystem that is failing, the NHC process can stay in D forever. There should be an option to allow or...
Building a small test cluster using PBSpro and I wasn't able to get NHC working out-of-the-box with PBSpro due to the differences in the pbsnodes command. The differences required more...
This patch will add a `check_hw_edac` check to verify correctable and uncorrectable ECC errors in memory, as reported by [`edac-utils`](https://github.com/grondo/edac-utils) EDAC is an alternative to MCE checks, with support for...