Die instead of exit when filesystem/device is read-only
It may happen that the filesystem become read-only (e.g. when many IO errors occur or a disk is detected faulty). In such cases, NHC tries to write its log file with no success but doesn't report the error. The patch below fixes this behavior and gives NHC a chance to drain the affected nodes.
https://github.com/edf-hpc/warewulf-nhc/blob/master/debian/patches/0002-Die-instead-of-exit-when-filesystem-device-is-read-o.patch
I'm not sure I understand. It does actually report the error (via syslog). The problem is that die() goes through a number of extra steps, including possibly not exiting (on SGE, et al.), and so it's specifically the wrong action to take in that case. That's why I chose to only pull out a few specific lines of code from the die() function and just execute those.
It sounds like the real issue you're getting at is that the node doesn't get offlined. If that's the case, I think the correct fix would be to add something like this:
[[ -n "$NHC_RM" && "$MARK_OFFLINE" -eq 1 ]] \
&& eval $OFFLINE_NODE "'$HOSTNAME'" "'Cannot write $LOGFILE as $USER (uid $EUID) -- Read-only filesystem/device failure?'"
Would that achieve what you're looking for?
It does report via syslog indeed but doesn't offline the node. I can test your suggestion on our systems and report back.
Hi @mehdid! Have you had the opportunity to test the suggested fix? Does it correctly address the issue?
Thanks!