sensu-plugins-mysql icon indicating copy to clipboard operation
sensu-plugins-mysql copied to clipboard

check-mysql-replication-status: Lag flapping protection

Open DrMurx opened this issue 5 years ago • 8 comments

Pull Request Checklist

General

  • [x] Update Changelog following the conventions laid out here

  • [x] Update README with any necessary configuration snippets

  • [x] RuboCop passes

  • [x] Existing tests pass

  • [x] First test written 😄

Purpose

MariaDB/MySQL sometimes wrongly reports a very high replication lag for a short moment. Flapping protection helps mitigating this issue better than setting occurrences in sensu's checks definition because you don't lose any alerting granularity.

I had to do major refactoring on check-mysql-replication-status to allow testing.

While doing so, I discovered two minor flaws. I've fixed one of them in dedicated commit upfront so that they could be cherry-picked independently if needed.

The other flaw affects the warning/critical thresholds which are exclusive, contrary to the common practice in other checks. Fixing this would be a breaking change, though, so I left a comment in code for a future release.

Known Compatibility Issues

DrMurx avatar Dec 18 '18 19:12 DrMurx

@majormoses The codeclimate-test-reporter in the (existing) test/spec_helper.rb prevents the tests from running. Can I ditch it?

DrMurx avatar Dec 19 '18 14:12 DrMurx

The codeclimate-test-reporter in the (existing) test/spec_helper.rb prevents the tests from running. Can I ditch it?

ya we kinda abandoned that as we had pretty low unit test coverage and I have decided to focus more on integration testing which can't really be measured in the same way. We can drop it and bring it back later if we get enough coverage to see the value in it.

majormoses avatar Dec 19 '18 18:12 majormoses

Thanks for your contribution to Sensu plugins! Without people like you submitting PRs we couldn't run the project. I will review it shortly.

majormoses avatar Dec 20 '18 05:12 majormoses

I will review the code in a minute but I did want to point out that sensu natively has its own flap detection algorithm borrowed from nagios. The Documentation is here look for low_flap_threshold and high_flap_threshold

majormoses avatar Dec 20 '18 05:12 majormoses

I will review the code in a minute but I did want to point out that sensu natively has its own flap detection algorithm borrowed from nagios. The Documentation is here look for low_flap_threshold and high_flap_threshold

Sensu's flapping protection is meant to prevent an unnecessary storm of alerting events in case a check is flapping. This PR does a completely different thing: it works around a condition of MySQL/MariaDB during which a replication slave incorrectly reports very high lag times (in my case it's >10 days!) for a short moment.

Of course I want any real lag to trigger a prompt alert, but I want the check to requery SHOW SLAVE STATUS under this specific condition.

Maybe "flapping protection" isn't the best name; maybe you have a different suggestion?

DrMurx avatar Dec 21 '18 10:12 DrMurx

@majormoses I'd like to go on with this one. With your experience in the plugin ecosystem, can you guide me to a different way of testing without including the check script? And maybe you have a better proposal for naming the feature because "flapping" is already a standard term in monitoring solutions? As I wrote earlier, we're dealing with an bug/side effect of MySQL/MariaDB replication which rarely report outliers in the replication lag.

DrMurx avatar Jan 22 '19 15:01 DrMurx

I am having a hard time coming up with a good name as its pretty much trying to work around a reporting bug. Maybe something like --discard-invalid-results-threshold, --discard-invalid-results-retries, --discard-invalid-results-sleep? I can't say I like that either but nothing obvious is coming to mind.

majormoses avatar Feb 21 '19 04:02 majormoses

@majormoses I've picked up work on this PR again, and I renamed the feature to lag outlier; I think that wording expresses that it's an exceptional condition, yet unrelated to the flapping protection sensu already offers.

Tests are now running properly. I removed the CodeClimate from spec_helper.rb, and added what you pointed out in comment https://github.com/sensu-plugins/sensu-plugins-mysql/pull/92#discussion_r258777728

DrMurx avatar Jun 17 '19 13:06 DrMurx