univention-corporate-server
univention-corporate-server copied to clipboard
Rework check_univention_replication nagios check script
Thank you for providing a pull request!
Please make sure you considered the following things
- [x] I read the contribution guidelines.
- [x] I read the code of conduct.
- [x] I created a bug report in the Univention Bugzilla.
- [x] I will add a bugzilla comment about this pull request.
Link to the issue in Bugzilla
https://forge.univention.org/bugzilla/show_bug.cgi?id=53730
Description of the changes
We reworked the univention_replication_check nagios plugin to better fit our requirements in a large customer's environment. The primary motivation for this was that we're processing a lot of LDAP changes each night, and our on-call team was being woken up in the middle of the night, even though everything was allright, just the replication taking some time. We've been using this reworked check in production for 2 months now. As discussed with Dirk Ahrnke, we're now contributing this back to Univention:
The most significant change is the changed alerting behavior:
- This check will report CRITICAL if:
- Replication has failed (failed.ldif exists) OR
- The listener id is behind that of the notifier AND
- The listener id has not changed since the last invocation falling inside the considered timeframe (between --min-age and --max-age)
- This check will report WARNING if:
- The notifier id couldn't be fetched (a stopped notifier shouldn't trigger an alert on the affected host, but not on every single replica node) OR
- No invocation history is present OR
- The listener is FAR (greater than the warning threshold) behind the primary's, but is progressing
- The following cases will report OK:
- The listener is in sync with the notifier OR
- The listener id is behind (but less than the warning threshold) the primary's, but is progressing
In addition, we introduced the following changes:
- Use Python 3, it's 2021 after all
- Use Python's argparse module for somewhat human-readable argument parsing, rather than getopt
- Add perfdata output containing the listener id, notifier id and their difference
Note that this check is NOT A DROP-IN REPLACEMENT for the existing check_univention_replication
. It uses different command line arguments, the history file uses a different format in a different place, and probably some other breaking changes:
usage: check_univention_replication [-h] [--version] [-v] [-r] [-w cnt] [-M seconds] [-m seconds] [-f file]
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-v, --verbose Verbose debug output
-r, --readonly Do not modify the history file
-w cnt, --warning cnt
WARNING if difference of transaction IDs is >= <cnt>
-M seconds, --max-age seconds
Disregard and remove all history entries older than <seconds>
-m seconds, --min-age seconds
Disregard all history entries younger than <seconds>
-f file, --hist-file file, --history-file file
Path to the history file
Thanks, I created a bugzilla entry: https://forge.univention.org/bugzilla/show_bug.cgi?id=53730