potential space leak in SGE load sensor loop
With the version of bash in RHEL6 (and presumably others), an unbounded space leak appears if NHC is used as an SGE load sensor. Running under (a rebuild of) Fedora's bash 4.3 is OK; I haven't tried other versions.
This patch causes it to bail out of the loop if the size of the process doubles, at which point it will get restarted.
diff --git a/nhc b/nhc
index 1705e79..706c07d 100755
--- a/nhc
+++ b/nhc
@@ -681,6 +692,29 @@ if [[ "$NHC_RM" == "sge" ]]; then
if nhcmain_run_checks ; then
nhcmain_finish
fi
+ # This loop leaks space with some versions of bash,
+ # e.g. RHEL6's version of 4.1; Fedora's version of 4.3 is OK.
+ # We'll bail out if memory use has ballooned too much, and
+ # execd will re-start us. Arbitrarily decide on RSS more then
+ # doubling after the first run (which doesn't take that long).
+ # For what it's worth, after triggering a core dump with nhc
+ # in a hard loop from "yes ''|nhc":
+ # strings core.5137|sort |uniq -c|sort -r -n |head
+ # 157052 e_size
+ # 25489 :1}"
+ # 7002 [*]}
+ # 6730 ARG"
+ # 6688 TARG"
+ # ...
+ # (The number of "e_size" entries is a bit less than the
+ # number of iterations.)
+ if [[ -z "$INIT_RSS" ]]; then
+ INIT_RSS=$(ps -p $$ -o rss | tail -n1)
+ elif (($(ps -p $$ -o rss) > 2*$INIT_RSS)); then
+ syslog "nhc bailing out with bash leak -- try a recent version of bash"
+ syslog_flush
+ exit 1
+ fi
done
else
nhcmain_load_scripts
I've been toying with the idea in my head of bailing on the idea of NHC looping itself--something I never originally designed/intended for it to do--and instead have it exec itself at the point where it had previously started the next loop iteration. I think this would address the above issue as well as some other potential issues (like data cache expiration). What are your thoughts?
You wrote:
I've been toying with the idea in my head of bailing on the idea of NHC looping itself--something I never originally designed/intended for it to do--and instead have it exec itself at the point where it had previously started the next loop iteration. I think this would address the above issue as well as some other potential issues (like data cache expiration). What are your thoughts?
I tried recursing in the load sensor loop but couldn't make it work to the extent I tried. I can't remember exactly what happened, but it was probably confusion over the i/o redirection.
So here's my new plan: Since what you have above is specific to the SGE behavior, I will do essentially what you have done there. In fact, please feel free to submit a PR against master for this if you'd like! That's what will go into 1.4.4.
For 1.5, I'd like to do something longer-term. I already have an item on my TODO list for a way to load NHC into a running Bash session and run checks interactively, and I believe the changes required to make that viable will also help its use as a load sensor.