nhc
nhc copied to clipboard
LBNL Node Health Check
This new feature in 1.4.3: ``` check_nvsmi_healthmon(): New check from CSC for GPU health monitoring via nvidia-smi ``` doesn't seem to be present in the release RPM file lbnl_nv.nhc. How...
I apologize for not having much experience with Pull Requests. Here are my new functions: check_all_fs_used, check_all_fs_inodes, check_all_fs_ifree, check_all_fs_iused which are used to check all filesystems of a particular "fstype"....
I added a helper script to mark nodes for reboot. It's based on `node-mark-offline`, but executes `scontrol reboot ASAP ` instead. This helper script can be used by setting `OFFLINE_NODE`...
We're running the NHC 1.4.3 RC1 RPM lbnl-nhc-1.4.3-1.el8.noarch on ~100 AlmaLinux 8.5 systems. These servers have Cornelis (Intel) Omni-Path 100 Gbit adapters, and I check them with this rule in...
Hello! it would be nice if you could make a new release with the fixes from the last years /Sven
Consider nodes name with domain "pi.sjtu.edu.cn", such as "node838.example.edu.cn": Current version of nhc always use long hostname ``` function nhcmain_init_env() { ... if [[ -r /proc/sys/kernel/hostname ]]; then read HOSTNAME...
Hi, There seems to be an error in the script nhc/helpers/node-mark-offline. there is a missing ";;" between line 69 and 70 to properly pass from one "case" statement to the...
If nhc was configured with options like `--prefix=/opt/nhc`, then default CONFDIR, INCDIR and so on would still point to /etc/nhc. It is acceptable for some cases, but often `/etc` is...
Please, add this check. Now I check this via dmidecode, but I should to specify id and it could change on different nodes. I would be good to change this...
While the code is perfectly functional in it's current state, the function **_nhc_hw_gather_data_** can take upwards of 40-60 seconds on multithreaded KNL nodes. It would be nice to optimize this...