icinga2 icon indicating copy to clipboard operation
icinga2 copied to clipboard

warning/JsonRpcConnection: Error while sending JSON-RPC message for identity (satellite) [2.13.7]

Open petr-fischer opened this issue 1 year ago • 8 comments

Describe the bug

Our master-satellite sync is breaking with this error/stacktrace (from the master icinga2.log):

[2023-06-12 16:58:32 +0200] information/ApiListener: Reconnecting to endpoint 'xxx-satellite.domain.cz' via host 'xxx.yyy.zzz' and port '5665'
[2023-06-12 16:58:32 +0200] information/ApiListener: New client connection for identity 'xxx-satellite.domain.cz' to [xxx.yyy.zzz]:5665
[2023-06-12 16:58:32 +0200] information/ApiListener: Finished reconnecting to endpoint 'xxx-satellite.domain.cz' via host 'xxx.yyy.zzz' and port '5665'
[2023-06-12 16:58:32 +0200] information/ApiListener: Sending config updates for endpoint 'xxx-satellite.domain.cz' in zone 'xxx'.
[2023-06-12 16:58:32 +0200] information/ApiListener: Syncing configuration files for zone 'xxx' to endpoint 'xxx-satellite.domain.cz'.
[2023-06-12 16:58:32 +0200] information/ApiListener: Syncing configuration files for global zone 'director-global' to endpoint 'xxx-satellite.domain.cz'.
[2023-06-12 16:58:32 +0200] information/ApiListener: Finished sending config file updates for endpoint 'xxx-satellite.domain.cz' in zone 'xxx'.
[2023-06-12 16:58:32 +0200] information/ApiListener: Syncing runtime objects to endpoint 'xxx-satellite.domain.cz'.
[2023-06-12 16:58:32 +0200] warning/JsonRpcConnection: Error while sending JSON-RPC message for identity 'xxx-satellite.domain.cz'
Error: Connection reset by peer

Stacktrace:
 0# __cxa_throw in /usr/lib64/icinga2/sbin/icinga2
 1# 0x000000000086B7BA in /usr/lib64/icinga2/sbin/icinga2
 2# icinga::JsonRpcConnection::WriteOutgoingMessages(boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >) in /usr/lib64/icinga2/sbin/icinga2
 3# 0x0000000000AE4456 in /usr/lib64/icinga2/sbin/icinga2
 4# 0x0000000000AE5910 in /usr/lib64/icinga2/sbin/icinga2
 5# make_fcontext in /lib64/libboost_context.so.1.69.0
[2023-06-12 16:58:32 +0200] warning/JsonRpcConnection: API client disconnected for identity 'xxx-satellite.domain.cz'
[2023-06-12 16:58:32 +0200] warning/ApiListener: Removing API client for endpoint 'xxx-satellite.domain.cz'. 0 API clients left.
[2023-06-12 16:58:32 +0200] information/ApiListener: Finished syncing runtime objects to endpoint 'xxx-satellite.domain.cz'.
[2023-06-12 16:58:32 +0200] information/ApiListener: Finished sending runtime config updates for endpoint 'xxx-satellite.domain.cz' in zone 'xxx'.
[2023-06-12 16:58:32 +0200] information/ApiListener: Sending replay log for endpoint 'xxx-satellite.domain.cz' in zone 'xxx'.
[2023-06-12 16:58:32 +0200] information/ApiListener: Finished sending replay log for endpoint 'xxx-satellite.domain.cz' in zone 'xxx'.
[2023-06-12 16:58:32 +0200] information/ApiListener: Finished syncing endpoint 'xxx-satellite.domain.cz' in zone 'xxx'.

Facts:

  • error in the log occurs on the master
  • 'xxx-satellite.domain.cz' is the satellite
  • 'xxx' zone is the zone for hosts and checks, that are running on the xxx-satellite.domain.cz satellite
  • both master and satellite are on the latest icinga 2.13.7 version
  • after satellite restart (systemctl reload icinga2), error disappears from the master icinga2.log for a moment, the next several synchronisations are OK, but within about an hour the error starts to reappear in the log
  • when this error occurs in the master log, our services from zone 'xxx' are starting to switch to the UNKNOWN (purple) state in the icinga web or they seems like stopped executing
  • we have also another, not yet updated satellite still on version 2.10.7, and there is no error messages for this old satellite in the master

To Reproduce

I don't know how to reproduce it. If you need our exact configuration of master, satellites and agents, we can anonymize it and send, but it's probably not necessary.

icinga2 --version

icinga2 - The Icinga 2 network monitoring daemon (version: r2.13.7-1)

Copyright (c) 2012-2023 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later <https://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

System information:
  Platform: CentOS Linux
  Platform version: 7 (Core)
  Kernel: Linux
  Kernel version: 3.10.0-1160.83.1.el7.x86_64
  Architecture: x86_64

Build information:
  Compiler: GNU 11.2.1
  Build host: runner-hh8q3bz2-project-575-concurrent-0
  OpenSSL version: OpenSSL 1.0.2k-fips  26 Jan 2017

Application information:

General paths:
  Config directory: /etc/icinga2
  Data directory: /var/lib/icinga2
  Log directory: /var/log/icinga2
  Cache directory: /var/cache/icinga2
  Spool directory: /var/spool/icinga2
  Run directory: /run/icinga2

Old paths (deprecated):
  Installation root: /usr
  Sysconf directory: /etc
  Run directory (base): /run
  Local state directory: /var

Internal paths:
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid

cat /etc/os-release

NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

icinga2 feature list

Disabled features: compatlog debuglog elasticsearch gelf graphite icingadb influxdb2 livestatus opentsdb perfdata statusdata syslog
Enabled features: api checker command ido-pgsql influxdb mainlog notification

Icinga Web 2 version and modules

Icinga Web 2 Version 	2.11.4
Git commit 	11453bfa92a70a44efbf7f966f5e7f27e9300a28
PHP Version 	7.3.33
Git commit date 	2023-01-26

Loaded Libraries
icinga/icinga-php-library 	0.11.0
icinga/icinga-php-thirdparty 	0.11.0

Loaded Modules
director 		1.10.2 	Configure
fileshipper 		1.1.0 	Configure
grafana 		1.3.6 	Configure
incubator 		0.20.0 	Configure
monitoring 		2.11.4 	Configure

icinga2 object list --type Endpoint

Output is over 6000 lines, too long.

icinga2 object list --type Zone

Output is over 6000 lines, too long.

yum list installed | grep boost (boost libs)

boost-program-options.x86_64
boost-regex.x86_64          1.53.0-28.el7     @CentOS7_os_x86_64     
boost-system.x86_64         1.53.0-28.el7     @CentOS7_os_x86_64     
boost-thread.x86_64         1.53.0-28.el7     @CentOS7_os_x86_64     
boost169-chrono.x86_64      1.69.0-2.el7      @epel                             
boost169-context.x86_64     1.69.0-2.el7      @epel                             
boost169-coroutine.x86_64   1.69.0-2.el7      @epel                             
boost169-date-time.x86_64   1.69.0-2.el7      @epel                             
boost169-filesystem.x86_64  1.69.0-2.el7      @epel                             
boost169-iostreams.x86_64   1.69.0-2.el7      @DataSpring_EPEL_7Server_x86_64   
boost169-program-options.x86_64
boost169-regex.x86_64       1.69.0-2.el7      @epel                             
boost169-system.x86_64      1.69.0-2.el7      @epel                             
boost169-thread.x86_64      1.69.0-2.el7      @epel

Additional context

There are similar bugs here, like #9153, but the error is different (broken pipe).

petr-fischer avatar Jun 12 '23 23:06 petr-fischer

What does xxx-satellite.domain.cz log when this happens?

With the stacktrace, the error probably looks more severe than it actually is. In itself, it just says that a connection got closed. Question is why that happened and why it doesn't get reestablished properly.

julianbrost avatar Jun 13 '23 07:06 julianbrost

We are planning to upgrade to 2.14 - if the error persists on 2.14, I will send the logs.

petr-fischer avatar Aug 02 '23 15:08 petr-fischer

ref/NC/792281

carraroj avatar Aug 14 '23 17:08 carraroj

Are there any updates regarding this issue? We also experience this issue from week to week and after icinga2 logs the error, the amount of overdue checks on the satellite grows immediately. Only a restart of the instance helps.

Tqnsls avatar Oct 09 '23 08:10 Tqnsls

ref/NC/804054

widhalmt avatar Dec 21 '23 10:12 widhalmt

Since Friday, April 26., we have the same problem. We have 2 Master and 16 Checker (8 Checker Zones + Master). One master and two checker of one zone now didn't syncronise with the config-master. The other seven checker-zones are working fine.

[2024-04-30 14:10:00 +0200] warning/JsonRpcConnection: Error while sending JSON-RPC message for identity 'checker1' Error: Connection reset by peer Stacktrace: 0# __cxa_throw in /usr/lib64/icinga2/sbin/icinga2 1# 0x00000000008C3B8C in /usr/lib64/icinga2/sbin/icinga2 2# icinga::JsonRpcConnection::WriteOutgoingMessages(boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >) in /usr/lib64/icinga2/sbin/icinga2 3# 0x0000000000B3DC27 in /usr/lib64/icinga2/sbin/icinga2 4# 0x0000000000B3EBCF in /usr/lib64/icinga2/sbin/icinga2 5# make_fcontext in /usr/lib64/icinga-boost/libboost_context.so.1.69.0 [2024-04-30 14:10:01 +0200] warning/JsonRpcConnection: API client disconnected for identity 'checker1' [2024-04-30 14:10:01 +0200] warning/ApiListener: Removing API client for endpoint 'checker1'. 0 API clients left.

[2024-04-30 12:11:59 +0200] warning/JsonRpcConnection: Error while sending JSON-RPC message for identity '2nd-master' Error: Broken pipe Stacktrace: 0# __cxa_throw in /usr/lib64/icinga2/sbin/icinga2 1# 0x00000000008C3B8C in /usr/lib64/icinga2/sbin/icinga2 2# icinga::JsonRpcConnection::WriteOutgoingMessages(boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >) in /usr/lib64/icinga2/sbin/icinga2 3# 0x0000000000B3DC27 in /usr/lib64/icinga2/sbin/icinga2 4# 0x0000000000B3EBCF in /usr/lib64/icinga2/sbin/icinga2 5# make_fcontext in /usr/lib64/icinga-boost/libboost_context.so.1.69.0 [2024-04-30 12:11:59 +0200] warning/JsonRpcConnection: API client disconnected for identity '2nd-master' [2024-04-30 12:11:59 +0200] warning/ApiListener: Removing API client for endpoint '2nd-master'. 0 API clients left.`

This happens every minute:

[2024-04-30 14:25:00 +0200] warning/JsonRpcConnection: Error while sending JSON-RPC message for identity 'checker1' [2024-04-30 14:25:17 +0200] warning/JsonRpcConnection: Error while sending JSON-RPC message for identity 'checker2' [2024-04-30 14:25:18 +0200] warning/JsonRpcConnection: Error while sending JSON-RPC message for identity '2nd-master' [2024-04-30 14:25:27 +0200] warning/JsonRpcConnection: Error while sending JSON-RPC message for identity 'checker1' [2024-04-30 14:25:56 +0200] warning/JsonRpcConnection: Error while sending JSON-RPC message for identity 'checker1' [2024-04-30 14:26:02 +0200] warning/JsonRpcConnection: Error while sending JSON-RPC message for identity '2nd-master' [2024-04-30 14:26:27 +0200] warning/JsonRpcConnection: Error while sending JSON-RPC message for identity 'checker1' [2024-04-30 14:26:48 +0200] warning/JsonRpcConnection: Error while sending JSON-RPC message for identity '2nd-master' [2024-04-30 14:26:57 +0200] warning/JsonRpcConnection: Error while sending JSON-RPC message for identity 'checker1' [2024-04-30 14:27:27 +0200] warning/JsonRpcConnection: Error while sending JSON-RPC message for identity 'checker1' [2024-04-30 14:27:33 +0200] warning/JsonRpcConnection: Error while sending JSON-RPC message for identity '2nd-master' [2024-04-30 14:27:47 +0200] warning/JsonRpcConnection: Error while sending JSON-RPC message for identity 'checker2' [2024-04-30 14:27:57 +0200] warning/JsonRpcConnection: Error while sending JSON-RPC message for identity 'checker1' [2024-04-30 14:28:19 +0200] warning/JsonRpcConnection: Error while sending JSON-RPC message for identity '2nd-master' [2024-04-30 14:28:27 +0200] warning/JsonRpcConnection: Error while sending JSON-RPC message for identity 'checker1' [2024-04-30 14:28:46 +0200] warning/JsonRpcConnection: Error while sending JSON-RPC message for identity 'checker2' [2024-04-30 14:28:57 +0200] warning/JsonRpcConnection: Error while sending JSON-RPC message for identity 'checker1' [2024-04-30 14:29:03 +0200] warning/JsonRpcConnection: Error while sending JSON-RPC message for identity '2nd-master' [2024-04-30 14:29:27 +0200] warning/JsonRpcConnection: Error while sending JSON-RPC message for identity 'checker1' [2024-04-30 14:29:48 +0200] warning/JsonRpcConnection: Error while sending JSON-RPC message for identity '2nd-master' [2024-04-30 14:29:57 +0200] warning/JsonRpcConnection: Error while sending JSON-RPC message for identity 'checker1' [2024-04-30 14:30:28 +0200] warning/JsonRpcConnection: Error while sending JSON-RPC message for identity 'checker1' [2024-04-30 14:30:34 +0200] warning/JsonRpcConnection: Error while sending JSON-RPC message for identity '2nd-master'

I updated both master an the affected zone to version: r2.14.2-1. The other zones are running at r2.14.0-1, 2.13.2-1 and r2.10.5-1

Skap81 avatar Apr 30 '24 12:04 Skap81

I found a error in the debug.log:

[2024-05-07 12:43:48 +0200] notice/JsonRpcConnection: Received 'log::SetLogPosition' message from identity 'master'.
[2024-05-07 12:43:48 +0200] notice/JsonRpcConnection: Error while reading JSON-RPC message for identity 'master': Error: Length specifier must not exceed 9 characters
[2024-05-07 12:43:48 +0200] warning/JsonRpcConnection: API client disconnected for identity 'master'
[2024-05-07 12:43:48 +0200] warning/ApiListener: Removing API client for endpoint 'master'. 0 API clients left.

What´s that: "Length specifier must not exceed 9 characters"

Skap81 avatar May 07 '24 10:05 Skap81

@Skap81 Afaik, the message stream between two Icinga2 instances is Netstring encoded and the message would suggest, that message is rejected if the length specifier exceeds 9 characters, meaning more than 1000000000 Bytes (1GiB). Not sure, how it gets that big though.

RincewindsHat avatar Jun 20 '24 13:06 RincewindsHat