icinga2 icon indicating copy to clipboard operation
icinga2 copied to clipboard

Checks executed twice and no recovery notifications are sent

Open danpicpic opened this issue 1 year ago • 6 comments

Describe the bug

We have observed a couple of times in the last 3 weeks a weird behaviour where the checks are performed twice, the notifications sent twice (at least the Problem one), but at the same time we also saw that no Recovery notifications were ever sent. Every time it happened in a small time frame (for e.g. between 8am and 9am), on different number of servers/services with no common pattern between them.

The checker and notifications features are enable in HA on both master. On both of them, from the icinga2.log (is it normal that they log the same? are they doing the same action in parallel?) I see the following lines, where a Problem notification is sent but not the Recovery one:

[2024-02-03 08:20:18 +0100] information/Checkable: Checkable 'hostxxx!servicexxxx' has 1 notification(s). Checking filters for type 'Problem', sends will be logged.
[2024-02-03 08:20:18 +0100] information/Notification: Sending 'Problem' notification 'hostxxx!servicexxxx!state-notification-to-service' for user 'dummy_user'
[2024-02-03 08:20:18 +0100] information/Notification: Completed sending 'Problem' notification 'hostxxx!servicexxxx!state-notification-to-service' for checkable 'hostxxx!servicexxxx' and user 'dummy_user' using command 'state-notification'.
[2024-02-03 08:20:18 +0100] information/Checkable: Checkable 'hostxxx!servicexxxx' has 1 notification(s). Checking filters for type 'Problem', sends will be logged.
[2024-02-03 08:43:18 +0100] information/Checkable: Checkable 'hostxxx!servicexxxx' has 1 notification(s). Checking filters for type 'Recovery', sends will be logged.
[2024-02-03 08:43:18 +0100] information/Checkable: Checkable 'hostxxx!servicexxxx' has 1 notification(s). Checking filters for type 'Recovery', sends will be logged.
  1. The first screenshot below is the one linked to the above log. All the messages regarding the notification not sent are weird, as the Problem notification was sent anyway, but not the Recovery.
  2. From the two screenshots we can see how every check/action is done twice or multiple times (soft state, hard state, ok, notifications)

Screenshots

icinga_ss1 icinga_ss2

Your Environment

Include as many relevant details about the environment you experienced the problem in

  • Version used (icinga2 --version): r2.14.1-1
  • Operating System and version: RHEL 9.2
  • Enabled features (icinga2 feature list): api-users api checker command graphite ido-mysql mainlog notification
  • Icinga Web 2 version and modules (System - About): 2.11.4
  • Config validation (icinga2 daemon -C): OK
  • If you run multiple Icinga 2 instances, the zones.conf file:
object Endpoint "master1" {
}
object Endpoint "master2" {
  host = "master2"
}
object Endpoint "satellite1" {
  host = "satellite1"
}
object Endpoint "satellite2" {
  host = "satellite2"
}
object Zone "director-global" {
  global = true
}
object Zone "global-templates" {
  global = true
}
object Zone "master" {
  endpoints = [ "master1", "master2", ]
}
object Zone "satellite" {
  endpoints = [ "satellite1", "satellite2", ]
  parent = "master"
}

Additional context

  • We have migrated our infrastructure from SLES12.5 (Icinga 2.10.3) to RHEL9 (Icinga 2.14.0) around 2 months ago
  • We have also installed jemalloc-5.2.1-2.el9.x86_64
  • At the beginning we only had test servers (of which ~1000 with active notifications) to validate the new Icinga2
  • 3 weeks ago we started to monitor the remaining ~2000 Production servers and upgraded Icinga2 to v2.14.1

We have started to see the error in the last 3 weeks, but we don't know if it was introduced by the last minor update to 2.14.1, or if it was already present since the first migration, but as we had fewer servers and less important, it might have been ignored.

danpicpic avatar Feb 08 '24 12:02 danpicpic

Every time it happened in a small time frame (for e.g. between 8am and 9am)

Do those frames correlate with your Director deployments?

Al2Klimov avatar Apr 25 '24 11:04 Al2Klimov

Every time it happened in a small time frame (for e.g. between 8am and 9am)

Do those frames correlate with your Director deployments?

We don't use Director. But anyway no, and it also happened other few times with less servers (even one single event at the time)

danpicpic avatar Apr 25 '24 11:04 danpicpic

We have experienced this additional few times. We have no clue how to debug it (db query?), or how to replicate it, therefore any suggestions are appreciated.

Just to sum up, and add 2 additional new info:

  • In the icingaweb history we see the checks attempts are displayed twice
  • eventually there are 2 notifications that are actually sent
  • no ok notification is ever sent, showing the the message This notification was not sent out to any contact. what is weird is that is always systematically like this (twice the critical/warning are sent but no ok is sent)
  • we have recently realised that we had both satellites communicating with each other. We have now set the communication only from satelliteA to satelliteB as suggested by the documentation (even though it should only cost in terms of performance and not of functionality). As the problem is not constant, we have no clue to know if this could be the solution

danpicpic avatar May 29 '24 14:05 danpicpic

Just for info, It has happened again few times, which means that indeed the two way communication between the satellite was not the issue. we have no clue on how to reproduce it, therefor we can't keep the debug activated for such an unknown time. In the icinga2.log we see exactly what the icingaweb history shows, and we don't know where else we could check to find helpful information.

EDIT: It's becoming a bit problematic, as we actually send the alerts to Alerta, and the control room keep on restarting working services (as no OK events are ever received from Icinga), and creating business issues.

danpicpic avatar Jul 12 '24 13:07 danpicpic

  • We've implemented a daily query to catch any duplicated service's history entry in the DB, as in the screenshot provided in the first post.
  • We've noticed (at least this time) that the problem occurred twice, each time after the reload of the Icinga2 Primary Master node.
  • It looks like that the Secondary Master (as confirmed by IcingaWeb) take the lead, as the Primary is unresponsive during the reload, but apparently they still both execute checks and send notification.
  • after 10 minutes more or less, the Primary master raise a idomysql deadlock, at which point the situation goes back to normal and it looks like that only one of the two master returns to be the only one writing to the DB.
  • Weird part is that they are always in conflict to execute the checks and send the Problem notification, but to send the Recovery one they both never do it and report that 'This notification was not sent out to any contact'

What info should I provide? the logs don't really differ from the first one that I've provided. We dont use Director, we have scheduled a daily rsync in the Primary master zones.d folder and then execute a reload

danpicpic avatar Aug 13 '24 06:08 danpicpic