repmgr icon indicating copy to clipboard operation
repmgr copied to clipboard

master node fails to automatically rejoin the cluster after recovery from failure

Open nuowei2543 opened this issue 10 months ago • 1 comments

Hello, during my simulation of host failover, I stopped the master host's PostgreSQL instance, and the standby node successfully switched to become the new master node. However, when I restarted the original master node, it did not automatically rejoin the cluster as a standby node. version: ubuntu:20.4 postgresql:16.2 repmgrd:5.4.1

1、 postgres@ser-compute-01:/disk1/postgresql/repmgr$ repmgr -f /disk1/postgresql/repmgr/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------ 1 | node1 | primary | * running | | default | 100 | 3 | host=10.0.14.100 port=5432 user=repmgr dbname=repmgr connect_timeout=2 2 | node2 | standby | running | node1 | default | 100 | 3 | host=10.0.14.101 port=5432 user=repmgr dbname=repmgr connect_timeout=2 3 | node3 | witness | * running | node1 | default | 0 | n/a | host=10.0.14.109 port=5432 user=repmgr dbname=repmgr connect_timeout=2

2、on node1 execute command supervisorctl stop postgresql

3、postgres@ser-compute-02:~$ repmgr -f /disk1/postgresql/repmgr/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------ 1 | node1 | primary | - failed | ? | default | 100 | | host=10.0.14.100 port=5432 user=repmgr dbname=repmgr connect_timeout=2 2 | node2 | primary | * running | | default | 100 | 2 | host=10.0.14.101 port=5432 user=repmgr dbname=repmgr connect_timeout=2 3 | node3 | witness | * running | node2 | default | 0 | n/a | host=10.0.14.109 port=5432 user=repmgr dbname=repmgr connect_timeout=2

4、on node1 execute command supervisorctl startpostgresql

5、postgres@ser-compute-02:/disk1/postgresql/repmgr$ repmgr -f /disk1/postgresql/repmgr/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------ 1 | node1 | primary | ! running | | default | 100 | 1 | host=10.0.14.100 port=5432 user=repmgr dbname=repmgr connect_timeout=2 2 | node2 | primary | * running | | default | 100 | 2 | host=10.0.14.101 port=5432 user=repmgr dbname=repmgr connect_timeout=2 3 | node3 | witness | * running | node2 | default | 0 | n/a | host=10.0.14.109 port=5432 user=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected

  • node "node1" (ID: 1) is running but the repmgr node record is inactive

So, I don't know why node1 is still the primary.

nuowei2543 avatar Apr 11 '24 03:04 nuowei2543

Hi, there is no inbuilt automatic rejoin. By just starting the old master again, you create a split brain scenario. But it's no problem to automatically rejoin the old master after promoting the new one via script.

stephan-hahn avatar Apr 25 '24 06:04 stephan-hahn