icingadb icon indicating copy to clipboard operation
icingadb copied to clipboard

retry deadline exceeded

Open mburring opened this issue 9 months ago • 2 comments

Describe the bug

We have two separate icinga instances running identical configurations and icingadb will randomly crash with a 'retry deadline exceeded' error.

Both of these installations are single master.

To Reproduce

Appears random

Expected behavior

That it doesn't happen

Your Environment

Include as many relevant details about the environment you experienced the problem in

  • Icinga DB version: 1.2.1-1+ubuntu20.04
  • Icinga 2 version: 2.14.5-1+ubuntu20.04
  • Operating System and version: Ubuntu 20.04

Additional context

● icingadb.service - Icinga DB
     Loaded: loaded (/lib/systemd/system/icingadb.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Tue 2025-03-11 01:50:01 AEDT; 10h ago
    Process: 1112676 ExecStart=/usr/sbin/icingadb --config /etc/icingadb/config.yml (code=exited, status=1/FAILURE)
   Main PID: 1112676 (code=exited, status=1/FAILURE)

Mar 11 01:49:01 master1 icingadb[1112676]: heartbeat: Waiting for Icinga heartbeat
Mar 11 01:49:20 master1 icingadb[1112676]: history-sync: Synced 5 notification history items
Mar 11 01:49:20 master1 icingadb[1112676]: history-sync: Synced 36 state history items
Mar 11 01:49:40 master1 icingadb[1112676]: history-sync: Synced 33 state history items
Mar 11 01:49:40 master1 icingadb[1112676]: history-sync: Synced 4 notification history items
Mar 11 01:50:00 master1 icingadb[1112676]: history-sync: Synced 4 notification history items
Mar 11 01:50:00 master1 icingadb[1112676]: history-sync: Synced 32 state history items
Mar 11 01:50:01 master1 icingadb[1112676]: retry deadline exceeded
                                                                           github.com/icinga/icingadb/pkg/icingadb.(*HA).controller
                                                                                   github.com/icinga/icingadb/pkg/icingadb/ha.go:166
                                                                           runtime.goexit
                                                                                   runtime/asm_amd64.s:1700
                                                                           HA aborted
                                                                           github.com/icinga/icingadb/pkg/icingadb.(*HA).abort.func1
                                                                                   github.com/icinga/icingadb/pkg/icingadb/ha.go:134
                                                                           sync.(*Once).doSlow
                                                                                   sync/once.go:76
                                                                           sync.(*Once).Do
                                                                                   sync/once.go:67
                                                                           github.com/icinga/icingadb/pkg/icingadb.(*HA).abort
                                                                                   github.com/icinga/icingadb/pkg/icingadb/ha.go:132
                                                                           github.com/icinga/icingadb/pkg/icingadb.(*HA).controller
                                                                                   github.com/icinga/icingadb/pkg/icingadb/ha.go:166
                                                                           runtime.goexit
                                                                                   runtime/asm_amd64.s:1700
                                                                           HA exited with an error
                                                                           main.run
                                                                                   github.com/icinga/icingadb/cmd/icingadb/main.go:336
                                                                           main.main
                                                                                   github.com/icinga/icingadb/cmd/icingadb/main.go:37
                                                                           runtime.main
                                                                                   runtime/proc.go:272
                                                                           runtime.goexit
                                                                                   runtime/asm_amd64.s:1700
Mar 11 01:50:01 master1 systemd[1]: icingadb.service: Main process exited, code=exited, status=1/FAILURE
Mar 11 01:50:01 master1 systemd[1]: icingadb.service: Failed with result 'exit-code'.

mburring avatar Mar 11 '25 01:03 mburring

Thanks for posting this issue.

Could you please provide the complete Icinga DB log from program start to crash with extended systemd journald fields? Please use either --output verbose or --output json as described here, https://icinga.com/docs/icinga-db/latest/doc/03-Configuration/#systemd-journald-fields.

Furthermore, could you please post a redacted version of your Icinga DB configuration and tell us which SQL database server you are using, version included.

The logs are starting with the following line:

Mar 11 01:49:01 master1 icingadb[1112676]: heartbeat: Waiting for Icinga heartbeat

Is your Icinga 2 healthy? And how about your Redis?

oxzi avatar Mar 11 '25 08:03 oxzi

Tue 2025-03-11 01:49:01.583741 AEDT [s=c37c97a5b0b04c05a06ab5e1eabff2a7;i=103ff6ad;b=f52e39aa9f7542ed859a9e8f612e52c2;m=10197455a31;t=62f>
    _SELINUX_CONTEXT=unconfined
    _BOOT_ID=f52e39aa9f7542ed859a9e8f612e52c2
    _MACHINE_ID=ed883a1210f14cbbae74c1b3fde55dc9
    _HOSTNAME=master1
    _TRANSPORT=journal
    _SYSTEMD_SLICE=system.slice
    _CAP_EFFECTIVE=0
    SYSLOG_IDENTIFIER=icingadb
    _PID=1112676
    _UID=116
    _GID=120
    _COMM=icingadb
    _EXE=/usr/sbin/icingadb
    _CMDLINE=/usr/sbin/icingadb --config /etc/icingadb/config.yml
    _SYSTEMD_CGROUP=/system.slice/icingadb.service
    _SYSTEMD_UNIT=icingadb.service
    _SYSTEMD_INVOCATION_ID=507ea6ed389d4e6d9f92ccce7ee098da
    PRIORITY=4
    MESSAGE=heartbeat: Waiting for Icinga heartbeat
    _SOURCE_REALTIME_TIMESTAMP=1741618141583741
Tue 2025-03-11 01:49:20.584464 AEDT [s=c37c97a5b0b04c05a06ab5e1eabff2a7;i=103ff6ae;b=f52e39aa9f7542ed859a9e8f612e52c2;m=101986747a8;t=62f>
    PRIORITY=6
    _SELINUX_CONTEXT=unconfined
    _BOOT_ID=f52e39aa9f7542ed859a9e8f612e52c2
    _MACHINE_ID=ed883a1210f14cbbae74c1b3fde55dc9
    _HOSTNAME=master1
    _TRANSPORT=journal
    _SYSTEMD_SLICE=system.slice
    _CAP_EFFECTIVE=0
    SYSLOG_IDENTIFIER=icingadb
    _PID=1112676
    _UID=116
    _GID=120
    _COMM=icingadb
    _EXE=/usr/sbin/icingadb
    _CMDLINE=/usr/sbin/icingadb --config /etc/icingadb/config.yml
    _SYSTEMD_CGROUP=/system.slice/icingadb.service
    _SYSTEMD_UNIT=icingadb.service
    _SYSTEMD_INVOCATION_ID=507ea6ed389d4e6d9f92ccce7ee098da
    MESSAGE=history-sync: Synced 5 notification history items
    _SOURCE_REALTIME_TIMESTAMP=1741618160584464
Tue 2025-03-11 01:49:20.584481 AEDT [s=c37c97a5b0b04c05a06ab5e1eabff2a7;i=103ff6af;b=f52e39aa9f7542ed859a9e8f612e52c2;m=10198674b7e;t=62f>
    PRIORITY=6
    _SELINUX_CONTEXT=unconfined
    _BOOT_ID=f52e39aa9f7542ed859a9e8f612e52c2
    _MACHINE_ID=ed883a1210f14cbbae74c1b3fde55dc9
    _HOSTNAME=master1
    _TRANSPORT=journal
    _SYSTEMD_SLICE=system.slice
    _CAP_EFFECTIVE=0
    SYSLOG_IDENTIFIER=icingadb
    _PID=1112676
    _UID=116
    _GID=120
    _COMM=icingadb
    _EXE=/usr/sbin/icingadb
    _CMDLINE=/usr/sbin/icingadb --config /etc/icingadb/config.yml
    _SYSTEMD_CGROUP=/system.slice/icingadb.service
    _SYSTEMD_UNIT=icingadb.service
    _SYSTEMD_INVOCATION_ID=507ea6ed389d4e6d9f92ccce7ee098da
    MESSAGE=history-sync: Synced 36 state history items
    _SOURCE_REALTIME_TIMESTAMP=1741618160584481
Tue 2025-03-11 01:49:40.585094 AEDT [s=c37c97a5b0b04c05a06ab5e1eabff2a7;i=103ff6b0;b=f52e39aa9f7542ed859a9e8f612e52c2;m=10199987701;t=62f>
    PRIORITY=6
    _SELINUX_CONTEXT=unconfined
    _BOOT_ID=f52e39aa9f7542ed859a9e8f612e52c2
    _MACHINE_ID=ed883a1210f14cbbae74c1b3fde55dc9
    _HOSTNAME=master1
    _TRANSPORT=journal
    _SYSTEMD_SLICE=system.slice
    _CAP_EFFECTIVE=0
    SYSLOG_IDENTIFIER=icingadb
    _PID=1112676
    _UID=116
    _GID=120
    _COMM=icingadb
    _EXE=/usr/sbin/icingadb
    _CMDLINE=/usr/sbin/icingadb --config /etc/icingadb/config.yml
    _SYSTEMD_CGROUP=/system.slice/icingadb.service
    _SYSTEMD_UNIT=icingadb.service
    _SYSTEMD_INVOCATION_ID=507ea6ed389d4e6d9f92ccce7ee098da
    MESSAGE=history-sync: Synced 33 state history items
    _SOURCE_REALTIME_TIMESTAMP=1741618180585094
Tue 2025-03-11 01:49:40.585790 AEDT [s=c37c97a5b0b04c05a06ab5e1eabff2a7;i=103ff6b1;b=f52e39aa9f7542ed859a9e8f612e52c2;m=101999879ac;t=62f>
    PRIORITY=6
    _SELINUX_CONTEXT=unconfined
    _BOOT_ID=f52e39aa9f7542ed859a9e8f612e52c2
    _MACHINE_ID=ed883a1210f14cbbae74c1b3fde55dc9
    _HOSTNAME=master1
    _TRANSPORT=journal
    _SYSTEMD_SLICE=system.slice
    _CAP_EFFECTIVE=0
    SYSLOG_IDENTIFIER=icingadb
    _PID=1112676
    _UID=116
    _GID=120
    _COMM=icingadb
    _EXE=/usr/sbin/icingadb
    _CMDLINE=/usr/sbin/icingadb --config /etc/icingadb/config.yml
    _SYSTEMD_CGROUP=/system.slice/icingadb.service
    _SYSTEMD_UNIT=icingadb.service
    _SYSTEMD_INVOCATION_ID=507ea6ed389d4e6d9f92ccce7ee098da
    MESSAGE=history-sync: Synced 4 notification history items
    _SOURCE_REALTIME_TIMESTAMP=1741618180585790
Tue 2025-03-11 01:50:00.584382 AEDT [s=c37c97a5b0b04c05a06ab5e1eabff2a7;i=103ff6b7;b=f52e39aa9f7542ed859a9e8f612e52c2;m=1019ac9a142;t=62f>
    PRIORITY=6
    _SELINUX_CONTEXT=unconfined
    _BOOT_ID=f52e39aa9f7542ed859a9e8f612e52c2
    _MACHINE_ID=ed883a1210f14cbbae74c1b3fde55dc9
    _HOSTNAME=master1
    _TRANSPORT=journal
    _SYSTEMD_SLICE=system.slice
    _CAP_EFFECTIVE=0
    SYSLOG_IDENTIFIER=icingadb
    _PID=1112676
    _UID=116
    _GID=120
    _COMM=icingadb
    _EXE=/usr/sbin/icingadb
    _CMDLINE=/usr/sbin/icingadb --config /etc/icingadb/config.yml
    _SYSTEMD_CGROUP=/system.slice/icingadb.service
    _SYSTEMD_UNIT=icingadb.service
    _SYSTEMD_INVOCATION_ID=507ea6ed389d4e6d9f92ccce7ee098da
    MESSAGE=history-sync: Synced 4 notification history items
    _SOURCE_REALTIME_TIMESTAMP=1741618200584382
Tue 2025-03-11 01:50:00.584424 AEDT [s=c37c97a5b0b04c05a06ab5e1eabff2a7;i=103ff6b8;b=f52e39aa9f7542ed859a9e8f612e52c2;m=1019ac9a2d0;t=62f>
    PRIORITY=6
    _SELINUX_CONTEXT=unconfined
    _BOOT_ID=f52e39aa9f7542ed859a9e8f612e52c2
    _MACHINE_ID=ed883a1210f14cbbae74c1b3fde55dc9
    _HOSTNAME=master1
    _TRANSPORT=journal
    _SYSTEMD_SLICE=system.slice
    _CAP_EFFECTIVE=0
    SYSLOG_IDENTIFIER=icingadb
    _PID=1112676
    _UID=116
    _GID=120
    _COMM=icingadb
    _EXE=/usr/sbin/icingadb
    _CMDLINE=/usr/sbin/icingadb --config /etc/icingadb/config.yml
    _SYSTEMD_CGROUP=/system.slice/icingadb.service
    _SYSTEMD_UNIT=icingadb.service
    _SYSTEMD_INVOCATION_ID=507ea6ed389d4e6d9f92ccce7ee098da
    MESSAGE=history-sync: Synced 32 state history items
    _SOURCE_REALTIME_TIMESTAMP=1741618200584424
Tue 2025-03-11 01:50:01.583004 AEDT [s=c37c97a5b0b04c05a06ab5e1eabff2a7;i=103ff6b9;b=f52e39aa9f7542ed859a9e8f612e52c2;m=1019ad8de32;t=62f>
    _SELINUX_CONTEXT=unconfined
    _BOOT_ID=f52e39aa9f7542ed859a9e8f612e52c2
    _MACHINE_ID=ed883a1210f14cbbae74c1b3fde55dc9
    _HOSTNAME=master1
    _TRANSPORT=journal
    _SYSTEMD_SLICE=system.slice
    _CAP_EFFECTIVE=0
    SYSLOG_IDENTIFIER=icingadb
    _PID=1112676
    _UID=116
    _GID=120
    _COMM=icingadb
    _EXE=/usr/sbin/icingadb
    _CMDLINE=/usr/sbin/icingadb --config /etc/icingadb/config.yml
    _SYSTEMD_CGROUP=/system.slice/icingadb.service
    _SYSTEMD_UNIT=icingadb.service
    _SYSTEMD_INVOCATION_ID=507ea6ed389d4e6d9f92ccce7ee098da
    PRIORITY=2
    MESSAGE=retry deadline exceeded
            github.com/icinga/icingadb/pkg/icingadb.(*HA).controller
                github.com/icinga/icingadb/pkg/icingadb/ha.go:166
            runtime.goexit
                runtime/asm_amd64.s:1700
            HA aborted
            github.com/icinga/icingadb/pkg/icingadb.(*HA).abort.func1
                github.com/icinga/icingadb/pkg/icingadb/ha.go:134
            sync.(*Once).doSlow
                sync/once.go:76
            sync.(*Once).Do
                sync/once.go:67
            github.com/icinga/icingadb/pkg/icingadb.(*HA).abort
                github.com/icinga/icingadb/pkg/icingadb/ha.go:132
            github.com/icinga/icingadb/pkg/icingadb.(*HA).controller
                github.com/icinga/icingadb/pkg/icingadb/ha.go:166
            runtime.goexit
                runtime/asm_amd64.s:1700
            HA exited with an error
            main.run
                github.com/icinga/icingadb/cmd/icingadb/main.go:336
            main.main
                github.com/icinga/icingadb/cmd/icingadb/main.go:37
            runtime.main
                runtime/proc.go:272
            runtime.goexit
                runtime/asm_amd64.s:1700
    _SOURCE_REALTIME_TIMESTAMP=1741618201583004
database:
  host: xxx
  port: 3306
  database: icingadb
  user: icingadb
  password: xxx
  tls: False
  ca: /usr/local/share/ca-certificates/xxx.crt
redis:
  host: localhost
  port: 6379
  password: xxx
  tls: true
  insecure: true 
logging:
  level: info
retention:
  history-days: 10
  sla-days: 10
  options:
    acknowledgement: 90
    comment: 365
    downtime: 90
    flapping: 10
    notification: 10
    state: 10

mariadb-server: 1:10.3.39-0ubuntu0.20.04.2

When this happens on both instances where icinga and redis are still running with nothing of note in their logs. The fix has been to restart the icingadb service and within a few days to a week this error will occur again. The 3 services are all running on the same host and the host itself is not loaded up.

mburring avatar Mar 11 '25 08:03 mburring

Hello,

I have the same problem in my environment. I am running an icigna cluster consisting of 2 master nodes with a HA database connection.

Environment

  • Icinga DB version: 1.3.0 (previous 1.2.0)
  • Icinga 2 version: 2.14.5-1 (previous 2.14.3)
  • Operating System and version: Ubuntu 22.04.5 LTS (GNU/Linux 5.15.0-136-generic x86_64)
  • Database: 10.5.27-MariaDB MariaDB Server, Icinga DB Schema v6 (previous v5)

The icingadb debug logs show that the heartbeat fails and the opposite instance can no longer be reached. Subsequently, the icingadb service crashes with the previously mentioned error message.

master instance A (icngadb debug logs)

[...]
XXX: high-availability: Can't update or insert instance. Retrying
XXX: heartbeat: Previous heartbeat not read from channel
XXX: heartbeat: Previous heartbeat not read from channel
XXX: heartbeat: Previous heartbeat not read from channel
XXX: Handing over 
[...]
XXX: heartbeat: Previous heartbeat not read from channel
XXX: heartbeat: Previous heartbeat not read from channel
XXX: retry deadline exceeded
  github.com/icinga/icingadb/pkg/icingadb.(*HA).controller
  github.com/icinga/icingadb/pkg/icingadb/ha.go:166
  runtime.goexit
  runtime/asm_amd64.s:1700
  HA aborted
[...]

master instance B (icngadb debug logs)

[...]
XXX: high-availability: Can't update or insert instance. Retrying
XXX: heartbeat: Previous heartbeat not read from channel
XXX: heartbeat: Previous heartbeat not read from channel
XXX: heartbeat: Previous heartbeat not read from channel
[...]

Although the two instances are running in a cluster and the icingadb service remains active on one host, the icinga web GUI shows a failure of the icingadb service and the loss of the database connection.

The behavior occurs randomly. The network connection between the two hosts is stable at all times. The problem existed both with v1.2.0 and with the current v1.3.0.

saiiman avatar Apr 16 '25 10:04 saiiman

@saiiman: Thank you for sharing that this error happens on your system as well.

The "high-availability: Can't update or insert instance. Retrying" log message implies that the HA realization logic failed, most likely due to some failing SQL query. Could you please share the logs - especially this line - including the journald fields as described in our "Systemd Journald Fields" docs.

oxzi avatar Apr 16 '25 10:04 oxzi

Hi, I hope the logs help to narrow down the problem.

instance A

Wed 2025-04-16 11:23:18.998176 CEST [xxx]
    MESSAGE=database: Executed "INSERT INTO \"history\" [...]
[...]
Wed 2025-04-16 11:24:57.638663 CEST [XXX]
    MESSAGE=high-availability: Can't update or insert instance. Retrying
    ICINGADB_ERROR=can't perform "SELECT id, heartbeat FROM icingadb_instance WHERE environment_id = ? AND responsible = ? AND id <> ? FOR UPDATE": Error 1205 (HY000): Lock wait timeout exceeded; try restarting transaction
Wed 2025-04-16 11:24:57.707474 CEST [XXX]
    MESSAGE=heartbeat: Previous heartbeat not read from channel
    ICINGADB_PREVIOUS=2025-04-16 11:24:54.7082618 +0200 CEST
    ICINGADB_CURRENT=2025-04-16 11:24:57.70739022 +0200 CEST
[...]
Wed 2025-04-16 11:25:38.998488 CEST [XXX]
    MESSAGE=database: Executed "INSERT INTO \"history\" [...]
[...]
Wed 2025-04-16 11:28:57.708586 CEST [XXX]
    MESSAGE=heartbeat: Previous heartbeat not read from channel
    ICINGADB_PREVIOUS=2025-04-16 11:28:54.708037993 +0200 CEST
    ICINGADB_CURRENT=2025-04-16 11:28:57.708451543 +0200 CEST
Wed 2025-04-16 11:29:00.708297 CEST [XXX]
    MESSAGE=heartbeat: Previous heartbeat not read from channel
    ICINGADB_PREVIOUS=2025-04-16 11:28:57.708451543 +0200 CEST
    ICINGADB_CURRENT=2025-04-16 11:29:00.708199054 +0200 CEST
Wed 2025-04-16 11:29:00.708346 CEST [XXX]
    MESSAGE=retry deadline exceeded
            github.com/icinga/icingadb/pkg/icingadb.(*HA).controller
                github.com/icinga/icingadb/pkg/icingadb/ha.go:166
            runtime.goexit
                runtime/asm_amd64.s:1700
            HA aborted
            github.com/icinga/icingadb/pkg/icingadb.(*HA).abort.func1
                github.com/icinga/icingadb/pkg/icingadb/ha.go:134
            sync.(*Once).doSlow
                sync/once.go:78
            sync.(*Once).Do
                sync/once.go:69
            github.com/icinga/icingadb/pkg/icingadb.(*HA).abort
                github.com/icinga/icingadb/pkg/icingadb/ha.go:132
            github.com/icinga/icingadb/pkg/icingadb.(*HA).controller
                github.com/icinga/icingadb/pkg/icingadb/ha.go:166
            runtime.goexit
                runtime/asm_amd64.s:1700
            HA exited with an error
            main.run
                github.com/icinga/icingadb/cmd/icingadb/main.go:347
            main.main
                github.com/icinga/icingadb/cmd/icingadb/main.go:37
            runtime.main
                runtime/proc.go:283
            runtime.goexit
                runtime/asm_amd64.s:1700
#
# manual service Restart
#
[...]
Wed 2025-04-16 11:39:15.871471 CEST [XXX]
    ICINGADB_ENVIRONMENT=1bed87bd19e8ebdf1a9510152bbfe2b7b978c0fa
    MESSAGE=high-availability: Another instance is active
    ICINGADB_INSTANCE_ID=95b7bfadcac6478ba86e5faa50dd7cfa
    ICINGADB_HEARTBEAT=2025-04-16 11:39:14.062 +0200 CEST
    ICINGADB_HEARTBEAT_AGE=1.809399542s

instance B

Wed 2025-04-16 11:32:47.859228 CEST [XXX]
    MESSAGE=heartbeat: Previous heartbeat not read from channel
    ICINGADB_PREVIOUS=2025-04-16 11:32:44.859798016 +0200 CEST
    ICINGADB_CURRENT=2025-04-16 11:32:47.859152456 +0200 CEST
Wed 2025-04-16 11:32:47.863733 CEST [XXX]
    MESSAGE=history-sync: Synced 1 state history items
Wed 2025-04-16 11:32:48.259701 CEST [XXX]
    MESSAGE=database: Executed "INSERT INTO \"history\" [...]
[...]
#
# manual service restart
#
[...]
Wed 2025-04-16 11:39:11.789801 CEST [XXX]
    MESSAGE=heartbeat: Previous heartbeat not read from channel
    ICINGADB_CURRENT=2025-04-16 11:39:11.789729945 +0200 CEST
    ICINGADB_PREVIOUS=2025-04-16 11:39:08.790338267 +0200 CEST
Wed 2025-04-16 11:39:15.753296 CEST [XXX]
    ICINGADB_ENVIRONMENT=1bed87bd19e8ebdf1a9510152bbfe2b7b978c0fa
    MESSAGE=high-availability: Preparing to take over HA as other instance's heartbeat has expired
    ICINGADB_INSTANCE_ID=eea7f5b48cb6465692006c3c926c1db2
    ICINGADB_HEARTBEAT=2025-04-16 11:23:03.155 +0200 CEST
    ICINGADB_HEARTBEAT_AGE=16m12.598267283s
Wed 2025-04-16 11:39:15.879097 CEST [XXX]
    MESSAGE=Taking over
    ICINGADB_REASON=other instance's heartbeat has expired

I also noticed in the logs that the handover to instance B only took place after a manual restart of the icingadb service there.

saiiman avatar Apr 16 '25 20:04 saiiman

Hi, I’m experiencing the same issue. Even though I run a single instance of Icinga, I’ve noticed it usually happens after backing up the database. This wasn’t an issue in previous versions.

joachim162 avatar May 25 '25 11:05 joachim162

@joachim162: This issue is related to the HA code. However, since i was not able to reproduce it yet, it is still open.

Could you please post your Icinga DB version, the relational database you are using (including version), and your Icinga DB logs prior to the crash? Thanks, this might help!

Since you mentioned backups: Are you using MySQL or MariaDB? If so, please take a look at the new Operations docs regarding backups.

oxzi avatar May 27 '25 06:05 oxzi

I am using MariaDB and yeah, the mysqldump was missing the --single-transaction flag. There was no InfluxDB crash since adding it. I am sorry, thank you.

joachim162 avatar Jun 02 '25 08:06 joachim162

@joachim162: This issue is not your fault. I am glad that your Icinga DB now works and no longer crashes.

oxzi avatar Jun 02 '25 08:06 oxzi

I too had this issue in my HA environment. The active instance suddenly wanted to hand over and then got the "retry deadline exceeded" error 5 min after that. That message is not shown with journalctl though. I could not find any more detailed information about missing heartbeats or something like that. We run MySQL 8.0.26 on a remote installation and Icinga runs on Rocky Linux 8.10 and have IcingaDB 1.3.0.

master1:

Jun  2 08:44:47 master1 icingadb[1701445]: runtime-updates: Upserted 2 Downtime items
Jun  2 08:44:47 master1 icingadb[1701445]: runtime-updates: Upserted 1 ServiceState items
Jun  2 08:45:04 master1 icingadb[1701445]: history-sync: Synced 2 state history items
Jun  2 08:45:07 master1 icingadb[1701445]: runtime-updates: Upserted 2 ServiceState items
Jun  2 08:46:24 master1 icingadb[1701445]: Handing over
Jun  2 08:51:18 master1 icingadb[1701445]: retry deadline exceeded#012github.com/icinga/icingadb/pkg/icingadb.(*HA).controller#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:166#012runtime.goexit#012#011runtime/asm_amd64.s:1700#012HA aborted#012github.com/icinga/icingadb/pkg/icingadb.(*HA).abort.func1#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:134#012sync.(*Once).doSlow#012#011sync/once.go:78#012sync.(*Once).Do#012#011sync/once.go:69#012github.com/icinga/icingadb/pkg/icingadb.(*HA).abort#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:132#012github.com/icinga/icingadb/pkg/icingadb.(*HA).controller#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:166#012runtime.goexit#012#011runtime/asm_amd64.s:1700#012HA exited with an error#012main.run#012#011github.com/icinga/icingadb/cmd/icingadb/main.go:347#012main.main#012#011github.com/icinga/icingadb/cmd/icingadb/main.go:37#012runtime.main#012#011runtime/proc.go:283#012runtime.goexit#012#011runtime/asm_amd64.s:1700
Jun  2 08:51:18 master1 systemd[1]: icingadb.service: Main process exited, code=exited, status=1/FAILURE
Jun  2 08:51:18 master1 systemd[1]: icingadb.service: Failed with result 'exit-code'.

master2:

Jun  2 08:43:44 master2 icingadb[3021718]: high-availability: Another instance is active
Jun  2 08:44:04 master2 icingadb[3021718]: history-sync: Synced 1 downtime history items
Jun  2 08:44:44 master2 icingadb[3021718]: history-sync: Synced 1 downtime history items
Jun  2 08:45:04 master2 icingadb[3021718]: history-sync: Synced 2 state history items
Jun  2 08:49:24 master2 icingadb[3021718]: history-sync: Synced 1 state history items
Jun  2 08:50:04 master2 icingadb[3021718]: history-sync: Synced 1 downtime history items
Jun  2 08:51:14 master2 icingadb[3021718]: retry deadline exceeded#012github.com/icinga/icingadb/pkg/icingadb.(*HA).controller#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:166#012runtime.goexit#012#011runtime/asm_amd64.s:1700#012HA aborted#012github.com/icinga/icingadb/pkg/icingadb.(*HA).abort.func1#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:134#012sync.(*Once).doSlow#012#011sync/once.go:78#012sync.(*Once).Do#012#011sync/once.go:69#012github.com/icinga/icingadb/pkg/icingadb.(*HA).abort#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:132#012github.com/icinga/icingadb/pkg/icingadb.(*HA).controller#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:166#012runtime.goexit#012#011runtime/asm_amd64.s:1700#012HA exited with an error#012main.run#012#011github.com/icinga/icingadb/cmd/icingadb/main.go:347#012main.main#012#011github.com/icinga/icingadb/cmd/icingadb/main.go:37#012runtime.main#012#011runtime/proc.go:283#012runtime.goexit#012#011runtime/asm_amd64.s:1700
Jun  2 08:51:14 master2 systemd[1]: icingadb.service: Main process exited, code=exited, status=1/FAILURE
Jun  2 08:51:14 master2 systemd[1]: icingadb.service: Failed with result 'exit-code'.

Perhaps a solution might be to configure Restart functionality in systemd? If the root cause is that the mysql has been unavailable for a while? Also this issue seems related: https://github.com/Icinga/icingadb/issues/794

minatoyama avatar Jun 02 '25 11:06 minatoyama

Perhaps a solution might be to configure Restart functionality in systemd?

For automatic systemd unit restarts, there is #958. However, we are not sure if we want to advertise this, since this may hide real persisting issues.

If the root cause is that the mysql has been unavailable for a while? Also this issue seems related: #794

At least some of these reconnection timeouts were addressed with #960 and Icinga/icinga-go-library#131, resulting in Icinga DB trying to reestablish a database server connection if a prior connection was once established. Thus, if your database server just went absence for a few minutes, this should not lead to a crash anymore with the next Icinga DB release 1.4.0.

That message is not shown with journalctl though.

Here I am not quite sure what exactly you are referring to, but certain information is only logged as journald fields, as documented here. Please note, also with the next release, at least the error messages are always logged as part of the main message, not only as journald fields.

oxzi avatar Jun 12 '25 07:06 oxzi

ref/NC/866982 active in icingaDB 1.4.0

carraroj avatar Sep 23 '25 07:09 carraroj