avoiding repeated messages when used as SGE load sensor
If NHC is used as an SGE load sensor with syslogging, it currently spams syslog with a message on each run until the problem is resolved. This change avoids sending messages when the state hasn't changed.
diff --git a/nhc b/nhc
index 1705e79..706c07d 100755
--- a/nhc
+++ b/nhc
@@ -40,6 +40,10 @@
### Library functions
+# Cache for the last message to avoid spamming syslog in the SGE loop
+# until the state changes.
+last_died_msg=
+
# Declare a print-error-and-exit function.
function die() {
IFS=$' \t\n'
@@ -48,8 +52,11 @@ function die() {
CHECK_DIED=1
log "ERROR: $NAME: Health check failed: $*"
- syslog "Health check failed: $*"
- syslog_flush
+ if [[ "$NHC_RM" != "sge" || "$*" != "$last_died_msg" ]]; then
+ last_died_msg="$*"
+ syslog "Health check failed: $*"
+ syslog_flush
+ fi
if [[ -n "$NHC_RM" && "$MARK_OFFLINE" -eq 1 && "$FAIL_CNT" -eq 0 ]]; then
eval $OFFLINE_NODE "'$HOSTNAME'" "'$*'"
fi
@@ -628,6 +635,10 @@ function nhcmain_mark_online() {
function nhcmain_finish() {
local ELAPSED
+ if [[ -n "$last_died_msg" ]]; then
+ syslog "Health check recovered"
+ last_died_msg=
+ fi
syslog_flush
ELAPSED=$((SECONDS-NHC_START_TS))
vlog "Node Health Check completed successfully (${ELAPSED}s)."
I'd like to figure out a more robust way of addressing this as it's not unique to SGE/looping-mode at all. Have you looked at nhc-wrapper? It basically addresses this by suppressing duplicate messages but allowing the cache of the previous message to expire every so often, thus allowing the warning to be refreshed at a set time interval (rather than forcing it to be either never, or at every execution). I'd like to figure out a way to offer that same functionality for this.
Maybe the right answer is to extract the syslog functionality into nhc-wrapper and allow for more configurable definitions of "reporting" in that location.
Any thoughts?
You wrote:
I'd like to figure out a more robust way of addressing this as it's not unique to SGE/looping-mode at all. Have you looked at nhc-wrapper`?
No; it didn't seem to be doing the same job, but maybe it would be straightforward to adapt it. I was going to resort to a wrapper script for the loop until I found I could use an alternative bash.
Here again, I want to put in something resembling the above for 1.4.4 and rethink for 1.5, similar to #11.