percona-clustercheck
percona-clustercheck copied to clipboard
Add flock to prevent concurrent clustercheck runs using up connections
When one of our nodes got a bit tied up due to a disk space issue, clustercheck started filling up the ps
list, waiting on mysql queries.
Wrapping the whole routine in this advice from flock(1)
:
(
flock -n 9 || exit 1
# ... commands executed under lock ...
) 9>/var/lock/mylockfile
As a result of this extra nesting, the majority of the file has been indented.
I've also pulled the HTTP responses out into functions to avoid repetition. The Content-Length
calculations might be slightly off, as I'm not sure whether or not all the \r\n
s are counted or not, so it just uses a string length check.
Thanks for the pull request. I have a question though. Did you see many clustercheck-processes? Because there already is a timeout of 10 seconds in the execution of the mysql command, after which it exits.
If the problem is a filling of ps, this won't solve your problem:
flock -w $TIMEOUT 9 || report_fail "clustercheck is blocked up."
With or without this change, there should never be any clustercheck-process running for more than 10 seconds. But instead of waiting for the mysql command, it will now wait for a file lock. But the ps-list still increases?
As a production cluster, I was in a bit of a rush and didn't stop to investigate this incidental flaw, but there were ten to twenty clustercheck processes in ps
that were apparently blocking on mysql
commands, and that was the case for a lot longer than ten seconds. At the time, it was possible to connect to mysqld
, but any query -- even a simple SHOW STATUS LIKE ...
-- would block. Now, the fact that the node was so messed up that it blocked on such a straightforward query is a different matter entirely ;)
To be clear, the mysql
commands were connecting to mysqld
successfully (and instantly) so --connect-timeout
was not relevant. And, there was no query timeout set on those calls by default... which, admittedly, is another problem!
Suffice to say, it was a fairly screwed-up situation that shouldn't have happened, but the numerous clustercheck
-launched mysql
s all blocking was the problem here.
Anyway, the flock -w $TIMEOUT
with a TIMEOUT
equal to 10 should mean a clustercheck process should wait on the flock call for up to ten seconds, and then exit 1
if it fails to acquire the lock. In this scenario, it'd mean I'd still have one clustercheck
blocking on the query, but at least I wouldn't be getting "Too many connections" just from clusterchecks alone.