openITCOCKPIT
openITCOCKPIT copied to clipboard
OK (HARD) state changes missing
Describe the bug We have some services with a defined event handler that is disabling hosts with faulty services from a cluster service. The event handler is written to only react on HARD states. Some of these services go to an UNKNOWN (HARD) state sometimes (e.g. no agent data for some time due to heavy load). Unfortunately if the services come back sometimes there is no proper state change to OK (HARD) and so the event handler to enable the hosts gets never called. Not sure if it also happens after CRITICAL (HARD) states.
root@openitc [core]: /opt/openitc/logs/nagios # zless nagios.log-2021071[0-9].gz nagios.log | perl -p -e 's/^\[([0-9]*)\]/"[".localtime($1)."]"/e' |grep d2d45d3f-61d9-4232-b05d-0be096e928e6 | grep HARD
[Fri Jul 9 06:03:20 2021] SERVICE ALERT: be38e06a-b6ec-49dd-b191-c5a3f75c2f23;d2d45d3f-61d9-4232-b05d-0be096e928e6;CRITICAL;HARD;1;CRITICAL: [...]
[Fri Jul 9 06:03:20 2021] SERVICE EVENT HANDLER: be38e06a-b6ec-49dd-b191-c5a3f75c2f23;d2d45d3f-61d9-4232-b05d-0be096e928e6;CRITICAL;HARD;1;024210cb-94a3-4e4f-bc52-8f6b063db1f4
[Fri Jul 9 06:04:20 2021] SERVICE ALERT: be38e06a-b6ec-49dd-b191-c5a3f75c2f23;d2d45d3f-61d9-4232-b05d-0be096e928e6;OK;HARD;1;OK: [...]
[Fri Jul 9 06:04:20 2021] SERVICE EVENT HANDLER: be38e06a-b6ec-49dd-b191-c5a3f75c2f23;d2d45d3f-61d9-4232-b05d-0be096e928e6;OK;HARD;1;024210cb-94a3-4e4f-bc52-8f6b063db1f4
[Thu Jul 15 04:45:35 2021] SERVICE ALERT: be38e06a-b6ec-49dd-b191-c5a3f75c2f23;d2d45d3f-61d9-4232-b05d-0be096e928e6;UNKNOWN;HARD;1;UNKNOWN: No data received from agent
[Thu Jul 15 04:45:35 2021] SERVICE EVENT HANDLER: be38e06a-b6ec-49dd-b191-c5a3f75c2f23;d2d45d3f-61d9-4232-b05d-0be096e928e6;UNKNOWN;HARD;1;024210cb-94a3-4e4f-bc52-8f6b063db1f4
[Thu Jul 15 04:46:20 2021] SERVICE ALERT: be38e06a-b6ec-49dd-b191-c5a3f75c2f23;d2d45d3f-61d9-4232-b05d-0be096e928e6;OK;HARD;1;OK: [...]
[Thu Jul 15 04:46:20 2021] SERVICE EVENT HANDLER: be38e06a-b6ec-49dd-b191-c5a3f75c2f23;d2d45d3f-61d9-4232-b05d-0be096e928e6;OK;HARD;1;024210cb-94a3-4e4f-bc52-8f6b063db1f4
[Sat Jul 17 05:57:23 2021] SERVICE ALERT: be38e06a-b6ec-49dd-b191-c5a3f75c2f23;d2d45d3f-61d9-4232-b05d-0be096e928e6;UNKNOWN;HARD;3;UNKNOWN: Custom check [...] timed out after 10s seconds
[Sat Jul 17 05:57:23 2021] SERVICE EVENT HANDLER: be38e06a-b6ec-49dd-b191-c5a3f75c2f23;d2d45d3f-61d9-4232-b05d-0be096e928e6;UNKNOWN;HARD;3;024210cb-94a3-4e4f-bc52-8f6b063db1f4
After this the service never becomes OK (HARD) again in the logs which is also visible in the "State History" of the service (could provide screenshot but will be very long). In the "History" (see screenshot below) however you see the service becoming OK (HARD) again just a few minutes later.
To Reproduce No idea, sometimes it works, sometimes not (see log).
Expected behavior Proper change to state OK (HARD) and sending according event handler.
Screenshots
Versions
- openITCOKPIT Server Version: 4.2.1
- Operating system: Ubuntu 20.04 LTS
Additional context n/a
Maybe this relates to https://github.com/naemon/naemon-core/issues/368 ?
As possible workaround I would trigger the event handler on all Ok states - not just hard states.
Is this still an issue?