sensu-plugins-mysql
sensu-plugins-mysql copied to clipboard
check-mysql-replication-status: Lag flapping protection
Pull Request Checklist
General
-
[x] Update Changelog following the conventions laid out here
-
[x] Update README with any necessary configuration snippets
-
[x] RuboCop passes
-
[x] Existing tests pass
-
[x] First test written 😄
Purpose
MariaDB/MySQL sometimes wrongly reports a very high replication lag for a short moment. Flapping protection helps mitigating this issue better than setting occurrences
in sensu's checks
definition because you don't lose any alerting granularity.
I had to do major refactoring on check-mysql-replication-status
to allow testing.
While doing so, I discovered two minor flaws. I've fixed one of them in dedicated commit upfront so that they could be cherry-picked independently if needed.
The other flaw affects the warning/critical thresholds which are exclusive, contrary to the common practice in other checks. Fixing this would be a breaking change, though, so I left a comment in code for a future release.
Known Compatibility Issues
@majormoses The codeclimate-test-reporter
in the (existing) test/spec_helper.rb
prevents the tests from running. Can I ditch it?
The codeclimate-test-reporter in the (existing) test/spec_helper.rb prevents the tests from running. Can I ditch it?
ya we kinda abandoned that as we had pretty low unit test coverage and I have decided to focus more on integration testing which can't really be measured in the same way. We can drop it and bring it back later if we get enough coverage to see the value in it.
Thanks for your contribution to Sensu plugins! Without people like you submitting PRs we couldn't run the project. I will review it shortly.
I will review the code in a minute but I did want to point out that sensu natively has its own flap detection algorithm borrowed from nagios. The Documentation is here look for low_flap_threshold
and high_flap_threshold
I will review the code in a minute but I did want to point out that sensu natively has its own flap detection algorithm borrowed from nagios. The Documentation is here look for
low_flap_threshold
andhigh_flap_threshold
Sensu's flapping protection is meant to prevent an unnecessary storm of alerting events in case a check is flapping. This PR does a completely different thing: it works around a condition of MySQL/MariaDB during which a replication slave incorrectly reports very high lag times (in my case it's >10 days!) for a short moment.
Of course I want any real lag to trigger a prompt alert, but I want the check to requery SHOW SLAVE STATUS
under this specific condition.
Maybe "flapping protection" isn't the best name; maybe you have a different suggestion?
@majormoses I'd like to go on with this one. With your experience in the plugin ecosystem, can you guide me to a different way of testing without including the check script? And maybe you have a better proposal for naming the feature because "flapping" is already a standard term in monitoring solutions? As I wrote earlier, we're dealing with an bug/side effect of MySQL/MariaDB replication which rarely report outliers in the replication lag.
I am having a hard time coming up with a good name as its pretty much trying to work around a reporting bug. Maybe something like --discard-invalid-results-threshold
, --discard-invalid-results-retries
, --discard-invalid-results-sleep
? I can't say I like that either but nothing obvious is coming to mind.
@majormoses I've picked up work on this PR again, and I renamed the feature to lag outlier
; I think that wording expresses that it's an exceptional condition, yet unrelated to the flapping protection sensu already offers.
Tests are now running properly. I removed the CodeClimate from spec_helper.rb
, and added what you pointed out in comment https://github.com/sensu-plugins/sensu-plugins-mysql/pull/92#discussion_r258777728