nhc icon indicating copy to clipboard operation
nhc copied to clipboard

check lustre filesystem health

Open flybirdkh opened this issue 5 years ago • 3 comments

how can i use nhc to check my lustre file system theath

when i set it use "* || check_cmd_output -t 5 -m '135T' -e '/usr/bin/lfs df -h|grep filesystem|grep T'" it show me ERROR: nhc: Health check failed: check_cmd_output: 4 returned by "/usr/bin/lfs df -h|grep filesystem|grep T".

flybirdkh avatar Jan 08 '20 06:01 flybirdkh

When executing a pipeline, the overall return code is based on the exit status of the last process. I'm not sure what would cause grep to return a 4; the documentation for return codes for GNU GREP is here: https://www.gnu.org/software/grep/manual/grep.html#Exit-Status

mej avatar Apr 19 '21 02:04 mej

If this can help you, this is what we have in our check Lustre health script on the clients

function check_lfs_servers(){
  lfs_check=$(/usr/bin/lfs check servers 2>&1 >/dev/null)

  if [[ -z $lfs_check ]] ; then
    return 0
  else
    die 1 "Could not reach at least one MDT or OST"
    return 1
  fi
}

guilbaults avatar Aug 06 '21 19:08 guilbaults

I can actually reproduce this behavior: it's not grep that returns 4, it's /usr/bin/lfs, because the pipe and following commands are interpreted as arguments to the lfs command.

This can be reproduced with something more verbose, like ls:

$ nhc -e "check_cmd_output -m '/foo/' -e 'ls -al /tmp/a | grep bar'"
ls: cannot access |: No such file or directory
ls: cannot access grep: No such file or directory
ls: cannot access bar: No such file or directory
ERROR:  nhc:  Health check failed:  check_cmd_output:  2 returned by "ls -al /tmp/a | grep bar"

It shows that ls tries to list files named |, grep and bar.

So piping doesn't seem to work easily with nhc_check_cmd. Is there a workaround, other than defining a whole separate check in a script file?

kcgthb avatar Jun 15 '22 19:06 kcgthb