icinga2 Replay-log not writing performance data

Describe the bug

In a single master setup, when the master is down, replay logs are written on all agents into /var/lib/icinga2/api/log/current; but when the master is up again, no performance data is replayed to our influxdb. The state history is also empty for that time.

It seems replaying to the master is working:

Agents log

[2019-12-09 12:21:12 +0100] information/ApiListener: Sending replay log for endpoint 'xxx' in zone 'master'.
[2019-12-09 12:21:12 +0100] notice/ApiListener: Replaying log: /var/lib/icinga2/api/log/current
[2019-12-09 12:21:12 +0100] information/ApiListener: Replayed 129 messages.

Masters log

[2019-12-09 12:21:12 +0100] notice/JsonRpcConnection: Received 'event::SetNextCheck' message from identity 'yyy'.
[2019-12-09 12:21:12 +0100] notice/JsonRpcConnection: Received 'log::SetLogPosition' message from identity 'yyy'.
[2019-12-09 12:21:12 +0100] notice/JsonRpcConnection: Received 'event::SetNextCheck' message from identity 'yyy'.
[2019-12-09 12:21:12 +0100] notice/JsonRpcConnection: Received 'event::CheckResult' message from identity 'yyy'.
[2019-12-09 12:21:12 +0100] notice/JsonRpcConnection: Received 'event::SetNextCheck' message from identity 'yyy'.
...

But for some reason the master is not processing the replay logs after receiving them.

To Reproduce

On the master stop the icinga2 service
On the agents wait for the replay log to be filled
Start the icinga2 service
Check for history data during the downtime

Expected behavior

Performance-data is replayed to data backends (e.g. influxdb)
state histories are updated

Your Environment

Version used (icinga2 --version): r2.11.2-1
Operating System and version: Ubuntu 18.04.3 LTS (Bionic Beaver)
Enabled features (icinga2 feature list): api checker command compatlog debuglog ido-mysql influxdb mainlog notification statusdata
Config validation (icinga2 daemon -C): all fine
If you run multiple Icinga 2 instances, the zones.conf file (or icinga2 object list --type Endpoint and icinga2 object list --type Zone) from all affected nodes.

Object 'yyy' of type 'Endpoint':
  % declared in '/etc/icinga2/zones.conf', lines 60:1-60:42
  * __name = "yyy"
  * host = "yyy"
    % = modified in '/etc/icinga2/zones.conf', lines 60:45-60:75
  * log_duration = 86400
  * name = "yyy"
  * package = "_etc"
  * port = "5665"
  * source_location
    * first_column = 1
    * first_line = 60
    * last_column = 42
    * last_line = 60
    * path = "/etc/icinga2/zones.conf"
  * templates = [ "yyy" ]
    % = modified in '/etc/icinga2/zones.conf', lines 60:1-60:42
  * type = "Endpoint"
  * zone = ""

Dec 09 '19 11:12 schuettloeffel-elsa

Hi,

do you see the InfluxDB metrics being collected from these check results? This can be seen/correlated in the debug log as well.

@lippserd Maybe there's something weird going on with the JSON messages here as well.

Dec 10 '19 09:12 dnsmichi

@dnsmichi : no, the next log entry for that agent writing to influx db is:

[2019-12-09 12:21:29 +0100] debug/InfluxdbWriter: Checkable 'yyy!workers' adds to metric list:'procs,hostname=yyy,service=workers,metric=procs crit=2500,value=6,warn=1850 1575890489'.

And that is already new data after the master was up again.

Dec 12 '19 08:12 schuettloeffel-elsa

Does the agent schedule the checks or is it only used as command endpoint? If the first it should work, if the second no checks are executed while disconnected because of the master is scheduling the checks but can not tell the endpoint to execute them. How this is setup I can not see from the issue, so can you provide at least one service configuration?

Dec 12 '19 08:12 dgoetz

@dgoetz: From the agents replay log being filled during master downtime, I assumed the agent still runs checks.

Here is an excerpt of the replay log where you can see performance data being written to:

897:{"message":"{\"jsonrpc\":\"2.0\",\"method\":\"event::CheckResult\",\"params\":{\"cr\":{\"active\":true,\"check_source\":\"yyy\",\"command\":[\"/usr/lib/nagios/plugins/check_procs\",\"-c\",\"400\",\"-w\",\"250\"],\"execution_end\":1576141528.909151,\"execution_start\":1576141528.880641,\"exit_status\":0.0,\"output\":\"PROCS OK: 151 processes \",\"performance_data\":[\"procs=151;250;400;0;\"],\"schedule_end\":1576141528.909239,\"schedule_start\":1576141528.8799999,\"state\":0.0,\"ttl\":0.0,\"type\":\"CheckResult\",\"vars_after\":{\"attempt\":1.0,\"reachable\":true,\"state\":0.0,\"state_type\":1.0},\"vars_before\":{\"attempt\":1.0,\"reachable\":true,\"state\":0.0,\"state_type\":1.0}},\"host\":\"yyy\",\"service\":\"procs\"},\"ts\":1576141528.909676}","secobj":{"name":"yyy!procs","type":"Service"},"timestamp":1576141528.909676}

The service definition:

apply Service "processes" {
  import "generic-service"

  check_command = "procs"
  command_endpoint = host.vars.remote_client

  /* Set W and C Values for all Hosts. */
  vars.procs_warning = 1500
  vars.procs_critical = 2000

  /* assign where a remote client is set */
  assign where host.vars.remote_client
}

Dec 12 '19 09:12 schuettloeffel-elsa

@lippserd this could correlate with your observation you've old me about.

Dec 18 '19 15:12 dnsmichi

@dnsmichi @lippserd Which observation?

Mar 18 '20 13:03 Al2Klimov

@dnsmichi @lippserd Which observation?

Performance data is missing in a cluster setup. And this one could be one of the reasons. Since @htriem is working on a setup to reproduce this, I'll hand this over.

Mar 19 '20 08:03 lippserd

Any progress here?

Aug 28 '20 07:08 schuettloeffel-elsa

@htriem Give us a status update.

Sep 28 '21 16:09 Al2Klimov

icinga2 icinga2 copied to clipboard

Replay-log not writing performance data

Describe the bug

To Reproduce

Expected behavior

Your Environment

icinga2
icinga2 copied to clipboard