openITCOCKPIT icon indicating copy to clipboard operation
openITCOCKPIT copied to clipboard

OK (HARD) state changes missing

Open exa-mk opened this issue 2 years ago • 1 comments

Describe the bug We have some services with a defined event handler that is disabling hosts with faulty services from a cluster service. The event handler is written to only react on HARD states. Some of these services go to an UNKNOWN (HARD) state sometimes (e.g. no agent data for some time due to heavy load). Unfortunately if the services come back sometimes there is no proper state change to OK (HARD) and so the event handler to enable the hosts gets never called. Not sure if it also happens after CRITICAL (HARD) states.

root@openitc [core]: /opt/openitc/logs/nagios # zless nagios.log-2021071[0-9].gz nagios.log | perl -p -e 's/^\[([0-9]*)\]/"[".localtime($1)."]"/e' |grep d2d45d3f-61d9-4232-b05d-0be096e928e6 | grep HARD
[Fri Jul  9 06:03:20 2021] SERVICE ALERT: be38e06a-b6ec-49dd-b191-c5a3f75c2f23;d2d45d3f-61d9-4232-b05d-0be096e928e6;CRITICAL;HARD;1;CRITICAL: [...]
[Fri Jul  9 06:03:20 2021] SERVICE EVENT HANDLER: be38e06a-b6ec-49dd-b191-c5a3f75c2f23;d2d45d3f-61d9-4232-b05d-0be096e928e6;CRITICAL;HARD;1;024210cb-94a3-4e4f-bc52-8f6b063db1f4
[Fri Jul  9 06:04:20 2021] SERVICE ALERT: be38e06a-b6ec-49dd-b191-c5a3f75c2f23;d2d45d3f-61d9-4232-b05d-0be096e928e6;OK;HARD;1;OK: [...]
[Fri Jul  9 06:04:20 2021] SERVICE EVENT HANDLER: be38e06a-b6ec-49dd-b191-c5a3f75c2f23;d2d45d3f-61d9-4232-b05d-0be096e928e6;OK;HARD;1;024210cb-94a3-4e4f-bc52-8f6b063db1f4
[Thu Jul 15 04:45:35 2021] SERVICE ALERT: be38e06a-b6ec-49dd-b191-c5a3f75c2f23;d2d45d3f-61d9-4232-b05d-0be096e928e6;UNKNOWN;HARD;1;UNKNOWN: No data received from agent
[Thu Jul 15 04:45:35 2021] SERVICE EVENT HANDLER: be38e06a-b6ec-49dd-b191-c5a3f75c2f23;d2d45d3f-61d9-4232-b05d-0be096e928e6;UNKNOWN;HARD;1;024210cb-94a3-4e4f-bc52-8f6b063db1f4
[Thu Jul 15 04:46:20 2021] SERVICE ALERT: be38e06a-b6ec-49dd-b191-c5a3f75c2f23;d2d45d3f-61d9-4232-b05d-0be096e928e6;OK;HARD;1;OK: [...]
[Thu Jul 15 04:46:20 2021] SERVICE EVENT HANDLER: be38e06a-b6ec-49dd-b191-c5a3f75c2f23;d2d45d3f-61d9-4232-b05d-0be096e928e6;OK;HARD;1;024210cb-94a3-4e4f-bc52-8f6b063db1f4
[Sat Jul 17 05:57:23 2021] SERVICE ALERT: be38e06a-b6ec-49dd-b191-c5a3f75c2f23;d2d45d3f-61d9-4232-b05d-0be096e928e6;UNKNOWN;HARD;3;UNKNOWN: Custom check [...] timed out after 10s seconds
[Sat Jul 17 05:57:23 2021] SERVICE EVENT HANDLER: be38e06a-b6ec-49dd-b191-c5a3f75c2f23;d2d45d3f-61d9-4232-b05d-0be096e928e6;UNKNOWN;HARD;3;024210cb-94a3-4e4f-bc52-8f6b063db1f4

After this the service never becomes OK (HARD) again in the logs which is also visible in the "State History" of the service (could provide screenshot but will be very long). In the "History" (see screenshot below) however you see the service becoming OK (HARD) again just a few minutes later.

To Reproduce No idea, sometimes it works, sometimes not (see log).

Expected behavior Proper change to state OK (HARD) and sending according event handler.

Screenshots image

Versions

  • openITCOKPIT Server Version: 4.2.1
  • Operating system: Ubuntu 20.04 LTS

Additional context n/a

exa-mk avatar Jul 19 '21 11:07 exa-mk

Maybe this relates to https://github.com/naemon/naemon-core/issues/368 ?

As possible workaround I would trigger the event handler on all Ok states - not just hard states.

nook24 avatar Aug 09 '21 06:08 nook24

Is this still an issue?

nook24 avatar Feb 09 '23 17:02 nook24