nhc icon indicating copy to clipboard operation
nhc copied to clipboard

Add GPFS health check

Open treydock opened this issue 7 years ago • 5 comments

I have only deployed this onto one system and one where I knew there were GPFS network issues with nodes not using RDMA that was configured:

[root@p0001 ~]# nhc
ERROR:  nhc:  Health check failed:  check_gpfs_health NETWORK: GPFS health for "NETWORK" is FAILED

Configured check:

* || check_gpfs_health NETWORK

One thing I am not sure on for behavior is what to do if the configured component isn't found in output, right now if you do check_gpfs_health FOO, there is no warning of failure.

treydock avatar Nov 01 '18 18:11 treydock

One thing that could probably be improved is allowing path to mmhealth to be changed to avoid hardcoding the value.

treydock avatar Nov 01 '18 18:11 treydock

Made path to mmhealth configurable and updated README.

treydock avatar Nov 05 '18 18:11 treydock

We noticed something with GPFS can cause mmhealth to be unreliable but that mmfsadm test verbs status is another way to test that GPFS is actually using RDMA and not ethernet fallback. Added check_gpfs_verbs_status check.

treydock avatar Dec 12 '18 17:12 treydock

We currently run mmfsadm test verbs status as a test for the same reason. Our test looks like this:

# Make sure GPFS RDMA VERBS started
* || check_cmd_output -m "VERBS RDMA status: started" /usr/lpp/mmfs/bin/mmfsadm test verbs status

So very similar. Worth nothing is that this was broken for a little while pretty recently (some version of 4.2.3.x it must have been), and in the interim we had to do this instead:

* || check_cmd_output -m '/^\ +VerbsRdmaStarted\ +:\ yes$/' /usr/lpp/mmfs/bin/mmfsadm test verbs config

IBM helped us figure that one out (which I guess is only fair as they broke mmfsadm).

For mmhealth though, don't run the monitoring stuff on our compute nodes, so mmhealth node show isn't of any use to us. We've considered it, but I'm not sure it's that commonplace on compute nodes.

[root@hal0003 ~]# /usr/lpp/mmfs/bin/mmhealth node show
The monitoring service is down and does not respond, please restart it with 'mmsysmoncontrol restart'

novosirj avatar Apr 18 '21 04:04 novosirj

This looks awesome, Trey! This will go into nhc/dev as soon as 1.4.3 is out the door. Thanks much!

mej avatar Apr 18 '21 16:04 mej