Add GPFS health check
I have only deployed this onto one system and one where I knew there were GPFS network issues with nodes not using RDMA that was configured:
[root@p0001 ~]# nhc
ERROR: nhc: Health check failed: check_gpfs_health NETWORK: GPFS health for "NETWORK" is FAILED
Configured check:
* || check_gpfs_health NETWORK
One thing I am not sure on for behavior is what to do if the configured component isn't found in output, right now if you do check_gpfs_health FOO, there is no warning of failure.
One thing that could probably be improved is allowing path to mmhealth to be changed to avoid hardcoding the value.
Made path to mmhealth configurable and updated README.
We noticed something with GPFS can cause mmhealth to be unreliable but that mmfsadm test verbs status is another way to test that GPFS is actually using RDMA and not ethernet fallback. Added check_gpfs_verbs_status check.
We currently run mmfsadm test verbs status as a test for the same reason. Our test looks like this:
# Make sure GPFS RDMA VERBS started
* || check_cmd_output -m "VERBS RDMA status: started" /usr/lpp/mmfs/bin/mmfsadm test verbs status
So very similar. Worth nothing is that this was broken for a little while pretty recently (some version of 4.2.3.x it must have been), and in the interim we had to do this instead:
* || check_cmd_output -m '/^\ +VerbsRdmaStarted\ +:\ yes$/' /usr/lpp/mmfs/bin/mmfsadm test verbs config
IBM helped us figure that one out (which I guess is only fair as they broke mmfsadm).
For mmhealth though, don't run the monitoring stuff on our compute nodes, so mmhealth node show isn't of any use to us. We've considered it, but I'm not sure it's that commonplace on compute nodes.
[root@hal0003 ~]# /usr/lpp/mmfs/bin/mmhealth node show
The monitoring service is down and does not respond, please restart it with 'mmsysmoncontrol restart'
This looks awesome, Trey! This will go into nhc/dev as soon as 1.4.3 is out the door. Thanks much!