nhc icon indicating copy to clipboard operation
nhc copied to clipboard

avoiding repeated messages when used as SGE load sensor

Open loveshack opened this issue 9 years ago • 3 comments

If NHC is used as an SGE load sensor with syslogging, it currently spams syslog with a message on each run until the problem is resolved. This change avoids sending messages when the state hasn't changed.

diff --git a/nhc b/nhc
index 1705e79..706c07d 100755
--- a/nhc
+++ b/nhc
@@ -40,6 +40,10 @@

 ### Library functions

+# Cache for the last message to avoid spamming syslog in the SGE loop
+# until the state changes.
+last_died_msg=
+
 # Declare a print-error-and-exit function.
 function die() {
     IFS=$' \t\n'
@@ -48,8 +52,11 @@ function die() {

     CHECK_DIED=1
     log "ERROR:  $NAME:  Health check failed:  $*"
-    syslog "Health check failed:  $*"
-    syslog_flush
+    if [[ "$NHC_RM" != "sge" || "$*" != "$last_died_msg" ]]; then
+       last_died_msg="$*"
+       syslog "Health check failed:  $*"
+       syslog_flush
+    fi
     if [[ -n "$NHC_RM" && "$MARK_OFFLINE" -eq 1 && "$FAIL_CNT" -eq 0 ]]; then
         eval $OFFLINE_NODE "'$HOSTNAME'" "'$*'"
     fi
@@ -628,6 +635,10 @@ function nhcmain_mark_online() {
 function nhcmain_finish() {
     local ELAPSED

+    if [[ -n "$last_died_msg" ]]; then
+       syslog "Health check recovered"
+       last_died_msg=
+    fi
     syslog_flush
     ELAPSED=$((SECONDS-NHC_START_TS))
     vlog "Node Health Check completed successfully (${ELAPSED}s)."

loveshack avatar Mar 10 '16 15:03 loveshack

I'd like to figure out a more robust way of addressing this as it's not unique to SGE/looping-mode at all. Have you looked at nhc-wrapper? It basically addresses this by suppressing duplicate messages but allowing the cache of the previous message to expire every so often, thus allowing the warning to be refreshed at a set time interval (rather than forcing it to be either never, or at every execution). I'd like to figure out a way to offer that same functionality for this.

Maybe the right answer is to extract the syslog functionality into nhc-wrapper and allow for more configurable definitions of "reporting" in that location.

Any thoughts?

mej avatar Mar 10 '16 23:03 mej

You wrote:

I'd like to figure out a more robust way of addressing this as it's not unique to SGE/looping-mode at all. Have you looked at nhc-wrapper`?

No; it didn't seem to be doing the same job, but maybe it would be straightforward to adapt it. I was going to resort to a wrapper script for the loop until I found I could use an alternative bash.

loveshack avatar Mar 14 '16 17:03 loveshack

Here again, I want to put in something resembling the above for 1.4.4 and rethink for 1.5, similar to #11.

mej avatar Mar 04 '23 08:03 mej