icinga2 icon indicating copy to clipboard operation
icinga2 copied to clipboard

Replay-log not writing performance data

Open schuettloeffel-elsa opened this issue 5 years ago • 9 comments

Describe the bug

In a single master setup, when the master is down, replay logs are written on all agents into /var/lib/icinga2/api/log/current; but when the master is up again, no performance data is replayed to our influxdb. The state history is also empty for that time.

It seems replaying to the master is working:

Agents log

[2019-12-09 12:21:12 +0100] information/ApiListener: Sending replay log for endpoint 'xxx' in zone 'master'.
[2019-12-09 12:21:12 +0100] notice/ApiListener: Replaying log: /var/lib/icinga2/api/log/current
[2019-12-09 12:21:12 +0100] information/ApiListener: Replayed 129 messages.

Masters log

[2019-12-09 12:21:12 +0100] notice/JsonRpcConnection: Received 'event::SetNextCheck' message from identity 'yyy'.
[2019-12-09 12:21:12 +0100] notice/JsonRpcConnection: Received 'log::SetLogPosition' message from identity 'yyy'.
[2019-12-09 12:21:12 +0100] notice/JsonRpcConnection: Received 'event::SetNextCheck' message from identity 'yyy'.
[2019-12-09 12:21:12 +0100] notice/JsonRpcConnection: Received 'event::CheckResult' message from identity 'yyy'.
[2019-12-09 12:21:12 +0100] notice/JsonRpcConnection: Received 'event::SetNextCheck' message from identity 'yyy'.
...

But for some reason the master is not processing the replay logs after receiving them.

To Reproduce

  1. On the master stop the icinga2 service
  2. On the agents wait for the replay log to be filled
  3. Start the icinga2 service
  4. Check for history data during the downtime

Expected behavior

  1. Performance-data is replayed to data backends (e.g. influxdb)
  2. state histories are updated

Your Environment

  • Version used (icinga2 --version): r2.11.2-1
  • Operating System and version: Ubuntu 18.04.3 LTS (Bionic Beaver)
  • Enabled features (icinga2 feature list): api checker command compatlog debuglog ido-mysql influxdb mainlog notification statusdata
  • Config validation (icinga2 daemon -C): all fine
  • If you run multiple Icinga 2 instances, the zones.conf file (or icinga2 object list --type Endpoint and icinga2 object list --type Zone) from all affected nodes.
Object 'yyy' of type 'Endpoint':
  % declared in '/etc/icinga2/zones.conf', lines 60:1-60:42
  * __name = "yyy"
  * host = "yyy"
    % = modified in '/etc/icinga2/zones.conf', lines 60:45-60:75
  * log_duration = 86400
  * name = "yyy"
  * package = "_etc"
  * port = "5665"
  * source_location
    * first_column = 1
    * first_line = 60
    * last_column = 42
    * last_line = 60
    * path = "/etc/icinga2/zones.conf"
  * templates = [ "yyy" ]
    % = modified in '/etc/icinga2/zones.conf', lines 60:1-60:42
  * type = "Endpoint"
  * zone = ""

schuettloeffel-elsa avatar Dec 09 '19 11:12 schuettloeffel-elsa

Hi,

do you see the InfluxDB metrics being collected from these check results? This can be seen/correlated in the debug log as well.

@lippserd Maybe there's something weird going on with the JSON messages here as well.

dnsmichi avatar Dec 10 '19 09:12 dnsmichi

@dnsmichi : no, the next log entry for that agent writing to influx db is:

[2019-12-09 12:21:29 +0100] debug/InfluxdbWriter: Checkable 'yyy!workers' adds to metric list:'procs,hostname=yyy,service=workers,metric=procs crit=2500,value=6,warn=1850 1575890489'.

And that is already new data after the master was up again.

schuettloeffel-elsa avatar Dec 12 '19 08:12 schuettloeffel-elsa

Does the agent schedule the checks or is it only used as command endpoint? If the first it should work, if the second no checks are executed while disconnected because of the master is scheduling the checks but can not tell the endpoint to execute them. How this is setup I can not see from the issue, so can you provide at least one service configuration?

dgoetz avatar Dec 12 '19 08:12 dgoetz

@dgoetz: From the agents replay log being filled during master downtime, I assumed the agent still runs checks.

Here is an excerpt of the replay log where you can see performance data being written to:

897:{"message":"{\"jsonrpc\":\"2.0\",\"method\":\"event::CheckResult\",\"params\":{\"cr\":{\"active\":true,\"check_source\":\"yyy\",\"command\":[\"/usr/lib/nagios/plugins/check_procs\",\"-c\",\"400\",\"-w\",\"250\"],\"execution_end\":1576141528.909151,\"execution_start\":1576141528.880641,\"exit_status\":0.0,\"output\":\"PROCS OK: 151 processes \",\"performance_data\":[\"procs=151;250;400;0;\"],\"schedule_end\":1576141528.909239,\"schedule_start\":1576141528.8799999,\"state\":0.0,\"ttl\":0.0,\"type\":\"CheckResult\",\"vars_after\":{\"attempt\":1.0,\"reachable\":true,\"state\":0.0,\"state_type\":1.0},\"vars_before\":{\"attempt\":1.0,\"reachable\":true,\"state\":0.0,\"state_type\":1.0}},\"host\":\"yyy\",\"service\":\"procs\"},\"ts\":1576141528.909676}","secobj":{"name":"yyy!procs","type":"Service"},"timestamp":1576141528.909676}

The service definition:

apply Service "processes" {
  import "generic-service"

  check_command = "procs"
  command_endpoint = host.vars.remote_client

  /* Set W and C Values for all Hosts. */
  vars.procs_warning = 1500
  vars.procs_critical = 2000

  /* assign where a remote client is set */
  assign where host.vars.remote_client
}

schuettloeffel-elsa avatar Dec 12 '19 09:12 schuettloeffel-elsa

@lippserd this could correlate with your observation you've old me about.

dnsmichi avatar Dec 18 '19 15:12 dnsmichi

@dnsmichi @lippserd Which observation?

Al2Klimov avatar Mar 18 '20 13:03 Al2Klimov

@dnsmichi @lippserd Which observation?

Performance data is missing in a cluster setup. And this one could be one of the reasons. Since @htriem is working on a setup to reproduce this, I'll hand this over.

lippserd avatar Mar 19 '20 08:03 lippserd

Any progress here?

schuettloeffel-elsa avatar Aug 28 '20 07:08 schuettloeffel-elsa

@htriem Give us a status update.

Al2Klimov avatar Sep 28 '21 16:09 Al2Klimov