icinga2 icon indicating copy to clipboard operation
icinga2 copied to clipboard

cluster-zone check reports mysterious zone lag values (at least for command_endpoint agents)

Open julianbrost opened this issue 2 years ago • 2 comments

With agent that only executes checks via command_endpoint and the cluster-zone check running in the parent zone for the zone of that agent (check_command = "cluster-zone", vars.cluster_zone = "agent-1", zone = "master") it can report bogus zone lag values once that agent goes offline:

Zone 'agent-1' is not connected. Log lag: 6 days, 19 hours, 48 minutes and 4 seconds

The zone lag value is determined using the ApiListener::CalculateZoneLag() function: https://github.com/Icinga/icinga2/blob/1c0a13c82bca59a3a3e414eec2cad9c379bbf5bf/lib/methods/clusterzonechecktask.cpp#L148

Which in turn uses Endpoint::GetRemoteLogPosition() but only returns a lag if the endpoint is not connected or currently syncing (see the if, which explains why you have to stop the agent to see strange values):

https://github.com/Icinga/icinga2/blob/1c0a13c82bca59a3a3e414eec2cad9c379bbf5bf/lib/remote/apilistener.cpp#L1697-L1706

The remote log position is only set when receiving a JSON-RPC message from the endpoint with a "ts" value set:

https://github.com/Icinga/icinga2/blob/1c0a13c82bca59a3a3e414eec2cad9c379bbf5bf/lib/remote/jsonrpcconnection.cpp#L250-L262

However, most messages sent by the agent in that scenario don't have "ts" set, as one can observer when running the agent with icinga2 daemon -DInternal.DebugJsonRpc=1 (the following are sent messages, shown by >> in the full output, omitted here as it completely messes up the JSON syntax highlighting):

{"jsonrpc":"2.0","method":"event::Heartbeat","params":{}}
{"jsonrpc":"2.0","method":"log::SetLogPosition","params":{"log_position":1701935120.892426}}
{"jsonrpc":"2.0","method":"event::CheckResult","params":{"cr":{...},"host":"agent-1","service":"icinga-cluster"}}

Now there's still a missing puzzle piece: there must be some (rare) messages that have "ts" set so that it ever gets set to a non-zero value. I don't yet know what these are in this scenario.

All tested with the current master branch (1c0a13c82bca59a3a3e414eec2cad9c379bbf5bf at the time of writing)

ref/IP/48662

julianbrost avatar Dec 07 '23 10:12 julianbrost

ref/NC/796479

tbauriedel avatar Dec 07 '23 11:12 tbauriedel

For the case someone set Endpoint#log_duration = 0 only after a while or even converted the node in question from satellite to agent, I'd not report such a lag at all in case of Endpoint#log_duration = 0. OK?

Al2Klimov avatar Dec 11 '23 12:12 Al2Klimov