icinga2
icinga2 copied to clipboard
Replay-log not writing performance data
Describe the bug
In a single master setup, when the master is down, replay logs are written on all agents into /var/lib/icinga2/api/log/current; but when the master is up again, no performance data is replayed to our influxdb. The state history is also empty for that time.
It seems replaying to the master is working:
Agents log
[2019-12-09 12:21:12 +0100] information/ApiListener: Sending replay log for endpoint 'xxx' in zone 'master'.
[2019-12-09 12:21:12 +0100] notice/ApiListener: Replaying log: /var/lib/icinga2/api/log/current
[2019-12-09 12:21:12 +0100] information/ApiListener: Replayed 129 messages.
Masters log
[2019-12-09 12:21:12 +0100] notice/JsonRpcConnection: Received 'event::SetNextCheck' message from identity 'yyy'.
[2019-12-09 12:21:12 +0100] notice/JsonRpcConnection: Received 'log::SetLogPosition' message from identity 'yyy'.
[2019-12-09 12:21:12 +0100] notice/JsonRpcConnection: Received 'event::SetNextCheck' message from identity 'yyy'.
[2019-12-09 12:21:12 +0100] notice/JsonRpcConnection: Received 'event::CheckResult' message from identity 'yyy'.
[2019-12-09 12:21:12 +0100] notice/JsonRpcConnection: Received 'event::SetNextCheck' message from identity 'yyy'.
...
But for some reason the master is not processing the replay logs after receiving them.
To Reproduce
- On the master stop the icinga2 service
- On the agents wait for the replay log to be filled
- Start the icinga2 service
- Check for history data during the downtime
Expected behavior
- Performance-data is replayed to data backends (e.g. influxdb)
- state histories are updated
Your Environment
- Version used (
icinga2 --version): r2.11.2-1 - Operating System and version: Ubuntu 18.04.3 LTS (Bionic Beaver)
- Enabled features (
icinga2 feature list): api checker command compatlog debuglog ido-mysql influxdb mainlog notification statusdata - Config validation (
icinga2 daemon -C): all fine - If you run multiple Icinga 2 instances, the
zones.conffile (oricinga2 object list --type Endpointandicinga2 object list --type Zone) from all affected nodes.
Object 'yyy' of type 'Endpoint':
% declared in '/etc/icinga2/zones.conf', lines 60:1-60:42
* __name = "yyy"
* host = "yyy"
% = modified in '/etc/icinga2/zones.conf', lines 60:45-60:75
* log_duration = 86400
* name = "yyy"
* package = "_etc"
* port = "5665"
* source_location
* first_column = 1
* first_line = 60
* last_column = 42
* last_line = 60
* path = "/etc/icinga2/zones.conf"
* templates = [ "yyy" ]
% = modified in '/etc/icinga2/zones.conf', lines 60:1-60:42
* type = "Endpoint"
* zone = ""
Hi,
do you see the InfluxDB metrics being collected from these check results? This can be seen/correlated in the debug log as well.
@lippserd Maybe there's something weird going on with the JSON messages here as well.
@dnsmichi : no, the next log entry for that agent writing to influx db is:
[2019-12-09 12:21:29 +0100] debug/InfluxdbWriter: Checkable 'yyy!workers' adds to metric list:'procs,hostname=yyy,service=workers,metric=procs crit=2500,value=6,warn=1850 1575890489'.
And that is already new data after the master was up again.
Does the agent schedule the checks or is it only used as command endpoint? If the first it should work, if the second no checks are executed while disconnected because of the master is scheduling the checks but can not tell the endpoint to execute them. How this is setup I can not see from the issue, so can you provide at least one service configuration?
@dgoetz: From the agents replay log being filled during master downtime, I assumed the agent still runs checks.
Here is an excerpt of the replay log where you can see performance data being written to:
897:{"message":"{\"jsonrpc\":\"2.0\",\"method\":\"event::CheckResult\",\"params\":{\"cr\":{\"active\":true,\"check_source\":\"yyy\",\"command\":[\"/usr/lib/nagios/plugins/check_procs\",\"-c\",\"400\",\"-w\",\"250\"],\"execution_end\":1576141528.909151,\"execution_start\":1576141528.880641,\"exit_status\":0.0,\"output\":\"PROCS OK: 151 processes \",\"performance_data\":[\"procs=151;250;400;0;\"],\"schedule_end\":1576141528.909239,\"schedule_start\":1576141528.8799999,\"state\":0.0,\"ttl\":0.0,\"type\":\"CheckResult\",\"vars_after\":{\"attempt\":1.0,\"reachable\":true,\"state\":0.0,\"state_type\":1.0},\"vars_before\":{\"attempt\":1.0,\"reachable\":true,\"state\":0.0,\"state_type\":1.0}},\"host\":\"yyy\",\"service\":\"procs\"},\"ts\":1576141528.909676}","secobj":{"name":"yyy!procs","type":"Service"},"timestamp":1576141528.909676}
The service definition:
apply Service "processes" {
import "generic-service"
check_command = "procs"
command_endpoint = host.vars.remote_client
/* Set W and C Values for all Hosts. */
vars.procs_warning = 1500
vars.procs_critical = 2000
/* assign where a remote client is set */
assign where host.vars.remote_client
}
@lippserd this could correlate with your observation you've old me about.
@dnsmichi @lippserd Which observation?
@dnsmichi @lippserd Which observation?
Performance data is missing in a cluster setup. And this one could be one of the reasons. Since @htriem is working on a setup to reproduce this, I'll hand this over.
Any progress here?
@htriem Give us a status update.