univention-corporate-server icon indicating copy to clipboard operation
univention-corporate-server copied to clipboard

Rework check_univention_replication nagios check script

Open s3lph opened this issue 3 years ago • 2 comments

Thank you for providing a pull request!

Please make sure you considered the following things

  • [x] I read the contribution guidelines.
  • [x] I read the code of conduct.
  • [x] I created a bug report in the Univention Bugzilla.
  • [x] I will add a bugzilla comment about this pull request.

Link to the issue in Bugzilla

https://forge.univention.org/bugzilla/show_bug.cgi?id=53730

Description of the changes

We reworked the univention_replication_check nagios plugin to better fit our requirements in a large customer's environment. The primary motivation for this was that we're processing a lot of LDAP changes each night, and our on-call team was being woken up in the middle of the night, even though everything was allright, just the replication taking some time. We've been using this reworked check in production for 2 months now. As discussed with Dirk Ahrnke, we're now contributing this back to Univention:

The most significant change is the changed alerting behavior:

  • This check will report CRITICAL if:
    • Replication has failed (failed.ldif exists) OR
    • The listener id is behind that of the notifier AND
    • The listener id has not changed since the last invocation falling inside the considered timeframe (between --min-age and --max-age)
  • This check will report WARNING if:
    • The notifier id couldn't be fetched (a stopped notifier shouldn't trigger an alert on the affected host, but not on every single replica node) OR
    • No invocation history is present OR
    • The listener is FAR (greater than the warning threshold) behind the primary's, but is progressing
  • The following cases will report OK:
    • The listener is in sync with the notifier OR
    • The listener id is behind (but less than the warning threshold) the primary's, but is progressing

In addition, we introduced the following changes:

  • Use Python 3, it's 2021 after all
  • Use Python's argparse module for somewhat human-readable argument parsing, rather than getopt
  • Add perfdata output containing the listener id, notifier id and their difference

Note that this check is NOT A DROP-IN REPLACEMENT for the existing check_univention_replication. It uses different command line arguments, the history file uses a different format in a different place, and probably some other breaking changes:

usage: check_univention_replication [-h] [--version] [-v] [-r] [-w cnt] [-M seconds] [-m seconds] [-f file]

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  -v, --verbose         Verbose debug output
  -r, --readonly        Do not modify the history file
  -w cnt, --warning cnt
                        WARNING if difference of transaction IDs is >= <cnt>
  -M seconds, --max-age seconds
                        Disregard and remove all history entries older than <seconds>
  -m seconds, --min-age seconds
                        Disregard all history entries younger than <seconds>
  -f file, --hist-file file, --history-file file
                        Path to the history file

s3lph avatar Aug 31 '21 15:08 s3lph

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Aug 31 '21 15:08 CLAassistant

Thanks, I created a bugzilla entry: https://forge.univention.org/bugzilla/show_bug.cgi?id=53730

spaceone avatar Aug 31 '21 15:08 spaceone