nhc Die instead of exit when filesystem/device is read-only

It may happen that the filesystem become read-only (e.g. when many IO errors occur or a disk is detected faulty). In such cases, NHC tries to write its log file with no success but doesn't report the error. The patch below fixes this behavior and gives NHC a chance to drain the affected nodes.

https://github.com/edf-hpc/warewulf-nhc/blob/master/debian/patches/0002-Die-instead-of-exit-when-filesystem-device-is-read-o.patch

Jun 28 '17 07:06 mehdid

I'm not sure I understand. It does actually report the error (via syslog). The problem is that die() goes through a number of extra steps, including possibly not exiting (on SGE, et al.), and so it's specifically the wrong action to take in that case. That's why I chose to only pull out a few specific lines of code from the die() function and just execute those.

It sounds like the real issue you're getting at is that the node doesn't get offlined. If that's the case, I think the correct fix would be to add something like this:

[[ -n "$NHC_RM" && "$MARK_OFFLINE" -eq 1 ]] \
  && eval $OFFLINE_NODE "'$HOSTNAME'" "'Cannot write $LOGFILE as $USER (uid $EUID) -- Read-only filesystem/device failure?'"

Would that achieve what you're looking for?

Jun 28 '17 20:06 mej

It does report via syslog indeed but doesn't offline the node. I can test your suggestion on our systems and report back.

Jun 30 '17 07:06 mehdid

Hi @mehdid! Have you had the opportunity to test the suggested fix? Does it correctly address the issue?

Thanks!

Oct 31 '18 19:10 mej