repmgr
repmgr copied to clipboard

Published 20 hours ago •

Reame
Issues

master node fails to automatically rejoin the cluster after recovery from failure

Open nuowei2543 opened this issue 10 months ago • 1 comments

Hello, during my simulation of host failover, I stopped the master host's PostgreSQL instance, and the standby node successfully switched to become the new master node. However, when I restarted the original master node, it did not automatically rejoin the cluster as a standby node. version: ubuntu:20.4 postgresql:16.2 repmgrd:5.4.1

1、 postgres@ser-compute-01:/disk1/postgresql/repmgr$ repmgr -f /disk1/postgresql/repmgr/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------ 1 | node1 | primary | * running | | default | 100 | 3 | host=10.0.14.100 port=5432 user=repmgr dbname=repmgr connect_timeout=2 2 | node2 | standby | running | node1 | default | 100 | 3 | host=10.0.14.101 port=5432 user=repmgr dbname=repmgr connect_timeout=2 3 | node3 | witness | * running | node1 | default | 0 | n/a | host=10.0.14.109 port=5432 user=repmgr dbname=repmgr connect_timeout=2

2、on node1 execute command supervisorctl stop postgresql

3、postgres@ser-compute-02:~$ repmgr -f /disk1/postgresql/repmgr/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------ 1 | node1 | primary | - failed | ? | default | 100 | | host=10.0.14.100 port=5432 user=repmgr dbname=repmgr connect_timeout=2 2 | node2 | primary | * running | | default | 100 | 2 | host=10.0.14.101 port=5432 user=repmgr dbname=repmgr connect_timeout=2 3 | node3 | witness | * running | node2 | default | 0 | n/a | host=10.0.14.109 port=5432 user=repmgr dbname=repmgr connect_timeout=2

4、on node1 execute command supervisorctl startpostgresql

5、postgres@ser-compute-02:/disk1/postgresql/repmgr$ repmgr -f /disk1/postgresql/repmgr/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------ 1 | node1 | primary | ! running | | default | 100 | 1 | host=10.0.14.100 port=5432 user=repmgr dbname=repmgr connect_timeout=2 2 | node2 | primary | * running | | default | 100 | 2 | host=10.0.14.101 port=5432 user=repmgr dbname=repmgr connect_timeout=2 3 | node3 | witness | * running | node2 | default | 0 | n/a | host=10.0.14.109 port=5432 user=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected

node "node1" (ID: 1) is running but the repmgr node record is inactive

So, I don't know why node1 is still the primary.

Apr 11 '24 03:04 nuowei2543

Hi, there is no inbuilt automatic rejoin. By just starting the old master again, you create a split brain scenario. But it's no problem to automatically rejoin the old master after promoting the new one via script.

Apr 25 '24 06:04 stephan-hahn