PostDock icon indicating copy to clipboard operation
PostDock copied to clipboard

killing primary results in it not being able to rejoin

Open yee379 opened this issue 4 years ago • 3 comments

i have postdock spun up on kubernetes. its working great! however, the first test i did was to kill the first statefulset replica by running kubectl delete pod/postgres-0 (yeah, i renamed everything).

so before the killing of the primary:

postgres@postgres-2:/$ repmgr cluster show
 ID | Name       | Role    | Status    | Upstream   | Location | Connection string
----+------------+---------+-----------+------------+----------+--------------------------------------------------------------------------------------------
 1000 | postgres-0 | primary | * running |            | default  | user=repl password=replicate host=postgres-0 dbname=replica_db port=5432 connect_timeout=2
 1001 | postgres-1 | standby |   running | postgres-0 | default  | user=repl password=replicate host=postgres-1 dbname=replica_db port=5432 connect_timeout=2
 1002 | postgres-2 | standby |   running | postgres-0 | default  | user=repl password=replicate host=postgres-2 dbname=replica_db port=5432 connect_timeout=2

and then after i get

postgres@postgres-2:/$ repmgr cluster show
 ID | Name       | Role    | Status    | Upstream   | Location | Connection string
----+------------+---------+-----------+------------+----------+--------------------------------------------------------------------------------------------
 1001 | postgres-1 | primary | * running |            | default  | user=repl password=replicate host=postgres-1 dbname=replica_db port=5432 connect_timeout=2
 1002 | postgres-2 | standby |   running | postgres-1 | default  | user=repl password=replicate host=postgres-2 dbname=replica_db port=5432 connect_timeout=2

postres-1 took over primary... good! but postgres-0 is completely missing... the logs from postgres-0

❯ k logs postgres-0 -f
>>> Setting up STOP handlers...
>>> STARTING SSH (if required)...
>>> SSH is not enabled!
>>> STARTING POSTGRES...
>>> SETTING UP POLYMORPHIC VARIABLES (repmgr=3+postgres=9 | repmgr=4, postgres=10)...
>>> TUNING UP POSTGRES...
>>> Configuring /var/lib/postgresql/data/postgresql.conf
>>>>>> Will add configs to the exists file
>>>>>> Adding config 'wal_keep_segments'='250'
>>>>>> Adding config 'shared_buffers'='300MB'
>>>>>> Adding config 'archive_command'=''/bin/true''
>>>>>> Adding config 'shared_preload_libraries'=''repmgr''
>>> Check all partner nodes for common upstream node...
>>>>>> Checking NODE=postgres-0.postgres...
psql: could not connect to server: Connection refused
	Is the server running on host "postgres-0.postgres" (10.40.0.13) and accepting
	TCP/IP connections on port 5432?
>>>>>> Skipping: failed to get master from the node!
>>>>>> Checking NODE=postgres-1.postgres...
>>>>>>>>> Pretending master role node - postgres-0.postgres
>>>>>> Checking NODE=postgres-2.postgres...
>>>>>>>>> Pretending master role node - postgres-0.postgres
>>> Auto-detected master name: 'postgres-0.postgres'
>>> Setting up repmgr...
>>> Setting up repmgr config file '/etc/repmgr.conf'...
>>> Setting up upstream node...
cat: /var/lib/postgresql/data/standby.lock: No such file or directory
>>> Previously Locked standby upstream node LOCKED_STANDBY=''
>>> Waiting for upstream postgres server...
>>> Wait schema replica_db.repmgr on postgres-0.postgres:5432(user: repl,password: *******), will try 30 times with delay 10 seconds (TIMEOUT=300)
psql: could not connect to server: Connection refused
	Is the server running on host "postgres-0.postgres" (10.40.0.13) and accepting
	TCP/IP connections on port 5432?
[snip!]
>>>>>> Host postgres-0.postgres:5432 is not accessible (will try 2 times more)
psql: could not connect to server: Connection refused
	Is the server running on host "postgres-0.postgres" (10.40.0.13) and accepting
	TCP/IP connections on port 5432?
>>>>>> Host postgres-0.postgres:5432 is not accessible (will try 1 times more)
>>> Schema replica_db.repmgr is not accessible, even after 30 tries!

Then it crashes and upon restart (ad infinitum...)

❯ k logs postgres-0 -f
>>> Setting up STOP handlers...
>>> STARTING SSH (if required)...
>>> SSH is not enabled!
>>> STARTING POSTGRES...
>>> SETTING UP POLYMORPHIC VARIABLES (repmgr=3+postgres=9 | repmgr=4, postgres=10)...
>>> TUNING UP POSTGRES...
>>> Configuring /var/lib/postgresql/data/postgresql.conf
>>>>>> Will add configs to the exists file
>>>>>> Adding config 'wal_keep_segments'='250'
>>>>>> Adding config 'shared_buffers'='300MB'
>>>>>> Adding config 'archive_command'=''/bin/true''
>>>>>> Adding config 'shared_preload_libraries'=''repmgr''
>>> Check all partner nodes for common upstream node...
>>>>>> Checking NODE=postgres-0.postgres...
psql: could not connect to server: Connection refused
	Is the server running on host "postgres-0.postgres" (10.40.0.13) and accepting
	TCP/IP connections on port 5432?
>>>>>> Skipping: failed to get master from the node!
>>>>>> Checking NODE=postgres-1.postgres...
>>>>>>>>> Pretending master role node - postgres-1.postgres
>>>>>> Checking NODE=postgres-2.postgres...
>>>>>>>>> Pretending master role node - postgres-1.postgres
>>> Auto-detected master name: 'postgres-1.postgres'
>>> Setting up repmgr...
>>> Setting up repmgr config file '/etc/repmgr.conf'...
>>> Setting up upstream node...
cat: /var/lib/postgresql/data/standby.lock: No such file or directory
>>> Previously Locked standby upstream node LOCKED_STANDBY=''
>>> Waiting for upstream postgres server...
>>> Wait schema replica_db.repmgr on postgres-1.postgres:5432(user: repl,password: *******), will try 30 times with delay 10 seconds (TIMEOUT=300)
>>>>>> Schema replica_db.repmgr exists on host postgres-1.postgres:5432!
>>> Waiting for upstream node id replica_db on postgres-1.postgres:5432(user: repl,password: *******), will try 30 times with delay 10 seconds (TIMEOUT=300)
>>>>>> Upstream node id 1001 received from host postgres-1.postgres:5432!
>>> REPLICATION_UPSTREAM_NODE_ID=1001
>>> Sending in background postgres start...
>>> Waiting for upstream postgres server...
>>> Wait schema replica_db.repmgr on postgres-1.postgres:5432(user: repl,password: *******), will try 30 times with delay 10 seconds (TIMEOUT=300)
>>>>>> Schema replica_db.repmgr exists on host postgres-1.postgres:5432!
>>> Data folder is not empty /var/lib/postgresql/data:
total 136
drwx------ 19 postgres tape      4096 Mar 23 07:31 .
drwxr-xr-x  1 postgres postgres  4096 Mar 23 07:31 ..
-rwx------  1 postgres postgres   175 Mar 23 07:29 backup_label.old
drwx------  7 postgres postgres  4096 Mar 23 07:17 base
drwx------  2 postgres postgres  4096 Mar 23 07:30 global
drwx------  2 postgres postgres  4096 Mar 23 07:17 pg_commit_ts
drwx------  2 postgres postgres  4096 Mar 23 07:17 pg_dynshmem
-rwx------  1 postgres postgres  4571 Mar 23 07:29 pg_hba.conf
-rwx------  1 postgres postgres  1636 Mar 23 07:29 pg_ident.conf
drwx------  4 postgres postgres  4096 Mar 23 07:30 pg_logical
drwx------  4 postgres postgres  4096 Mar 23 07:17 pg_multixact
drwx------  2 postgres postgres  4096 Mar 23 07:30 pg_notify
drwx------  2 postgres postgres  4096 Mar 23 07:29 pg_replslot
drwx------  2 postgres postgres  4096 Mar 23 07:17 pg_serial
drwx------  2 postgres postgres  4096 Mar 23 07:17 pg_snapshots
drwx------  2 postgres postgres  4096 Mar 23 07:30 pg_stat
drwx------  2 postgres postgres  4096 Mar 23 07:30 pg_stat_tmp
drwx------  2 postgres postgres  4096 Mar 23 07:30 pg_subtrans
drwx------  2 postgres postgres  4096 Mar 23 07:17 pg_tblspc
drwx------  2 postgres postgres  4096 Mar 23 07:17 pg_twophase
-rwx------  1 postgres postgres     3 Mar 23 07:17 PG_VERSION
drwx------  3 postgres postgres  4096 Mar 23 07:29 pg_wal
drwx------  2 postgres postgres  4096 Mar 23 07:17 pg_xact
-rwx------  1 postgres postgres    88 Mar 23 07:29 postgresql.auto.conf
-rwx------  1 postgres postgres 21646 Mar 23 07:31 postgresql.conf
-rwx------  1 postgres postgres    36 Mar 23 07:30 postmaster.opts
-rwx------  1 postgres postgres    96 Mar 23 07:30 postmaster.pid
-rwx------  1 postgres postgres   232 Mar 23 07:29 recovery.done
>>> Starting standby node...
>>> Instance has been set up already.
>>> Rewinding to the latest state
>>>>>> Archiving configs
The following command line errors were encountered:
  unknown repmgr action 'standby archive-config'
Try "repmgr --help" for more information.
>>>>>> Start server to be able to rewind (weird hack to avoid dirty shutdown issue)
waiting for server to start....2020-03-23 07:31:21.567 UTC [189] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2020-03-23 07:31:21.567 UTC [189] LOG:  could not bind IPv6 address "::1": Cannot assign requested address
2020-03-23 07:31:21.567 UTC [189] HINT:  Is another postmaster already running on port 5432? If not, wait a few seconds and retry.
2020-03-23 07:31:21.568 UTC [189] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2020-03-23 07:31:21.597 UTC [190] LOG:  database system was interrupted; last known up at 2020-03-23 07:30:19 UTC
2020-03-23 07:31:21.626 UTC [190] LOG:  database system was not properly shut down; automatic recovery in progress
2020-03-23 07:31:21.627 UTC [190] LOG:  redo starts at 0/5026958
2020-03-23 07:31:21.628 UTC [190] LOG:  redo done at 0/5026990
2020-03-23 07:31:21.638 UTC [189] LOG:  database system is ready to accept connections
 done
server started
>>>>>> Removing unactive replication slots of partners
>>> Getting slots from master node
repmgr_slot_1002
>>> Closing slots which exist on master node
>>>>>> Closing slot repmgr_slot_1002
2020-03-23 07:31:21.809 UTC [226] ERROR:  replication slot "repmgr_slot_1002" does not exist
2020-03-23 07:31:21.809 UTC [226] STATEMENT:  SELECT pg_drop_replication_slot('repmgr_slot_1002')
ERROR:  replication slot "repmgr_slot_1002" does not exist
>>>>>>>>> Can't close the slot!
>>> Closing slot which current master used on the node
>>>>>> Closing slot repmgr_slot_1001
2020-03-23 07:31:21.989 UTC [257] ERROR:  replication slot "repmgr_slot_1001" does not exist
2020-03-23 07:31:21.989 UTC [257] STATEMENT:  SELECT pg_drop_replication_slot('repmgr_slot_1001');
ERROR:  replication slot "repmgr_slot_1001" does not exist
>>>>>>>>> Can't close the slot!
>>>>>> Stop server
2020-03-23 07:31:21.996 UTC [189] LOG:  received fast shutdown request
waiting for server to shut down....2020-03-23 07:31:21.997 UTC [189] LOG:  aborting any active transactions
2020-03-23 07:31:22.000 UTC [189] LOG:  background worker "logical replication launcher" (PID 196) exited with exit code 1
2020-03-23 07:31:22.000 UTC [191] LOG:  shutting down
2020-03-23 07:31:22.016 UTC [189] LOG:  database system is shut down
 done
server stopped
>>>>>> Creating replication slot for the standby in case it's not there
cat: /var/lib/postgresql/data/recovery.conf: No such file or directory
>>>>>> Rewind to the host postgres-1.postgres
source and target cluster are on the same timeline
no rewind required
>>>>>> Restoring configs
The following command line errors were encountered:
  unknown repmgr action 'standby restore-config'
Try "repmgr --help" for more information.
>>>>>> Tell repmgr to follow upstream for the node
DEBUG: do_standby_follow()
DEBUG: connecting to: "user=repl password=replicate connect_timeout=2 dbname=replica_db host=postgres-0.postgres port=5432 fallback_application_name=repmgr"
ERROR: connection to database failed:
  could not connect to server: Connection refused
	Is the server running on host "postgres-0.postgres" (10.40.0.13) and accepting
	TCP/IP connections on port 5432?

DETAIL: attempted to connect using:
  user=repl password=replicate connect_timeout=2 dbname=replica_db host=postgres-0.postgres port=5432 fallback_application_name=repmgr
HINT: use "repmgr node rejoin" to re-add an inactive node to the replication cluster
INFO: looking for configuration file in /etc
INFO: configuration file found at: "/etc/repmgr.conf"
>>>>>> Stop server
pg_ctl: PID file "/var/lib/postgresql/data/postmaster.pid" does not exist
Is server running?
>>> Configuring /var/lib/postgresql/data/postgresql.conf
>>>>>> Will add configs to the exists file
>>>>>> Adding config 'wal_keep_segments'='250'
>>>>>> Adding config 'shared_buffers'='300MB'
>>>>>> Adding config 'archive_command'=''/bin/true''
>>>>>> Adding config 'shared_preload_libraries'=''repmgr''
>>> Starting postgres...
>>> Waiting for local postgres server recovery if any in progress:LAUNCH_RECOVERY_CHECK_INTERVAL=30
>>> Recovery is in progress:
2020-03-23 07:31:22.308 UTC [309] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2020-03-23 07:31:22.308 UTC [309] LOG:  listening on IPv6 address "::", port 5432
2020-03-23 07:31:22.309 UTC [309] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2020-03-23 07:31:22.339 UTC [320] LOG:  database system was shut down at 2020-03-23 07:31:22 UTC
2020-03-23 07:31:22.345 UTC [309] LOG:  database system is ready to accept connections
>>>>>> RECOVERY_WAL_ID is empty!
>>> Not in recovery state (anymore)
>>> Waiting for local postgres server start...
>>> Wait schema replica_db.public on postgres-0.postgres:5432(user: repl,password: *******), will try 60 times with delay 10 seconds (TIMEOUT=600)
>>>>>> Schema replica_db.public exists on host postgres-0.postgres:5432!
>>> Unregister the node if it was done before
DELETE 0
>>> Registering node with role standby
INFO: connecting to local node "postgres-0" (ID: 1000)
ERROR: this node should be a standby (user=repl password=replicate host=postgres-0.postgres dbname=replica_db port=5432 connect_timeout=2)
>>> Starting repmgr daemon...
INFO: looking for configuration file in /etc
INFO: configuration file found at: "/etc/repmgr.conf"
[2020-03-23 07:31:52] [NOTICE] repmgrd (repmgr 4.0.6) starting up
[2020-03-23 07:31:52] [INFO] connecting to database "user=repl password=replicate host=postgres-0.postgres dbname=replica_db port=5432 connect_timeout=2"
[2020-03-23 07:31:52] [ERROR] this node is marked as inactive and cannot be used as a failover target
[2020-03-23 07:31:52] [HINT] Check that "repmgr (primary|standby) register" was executed for this node
[2020-03-23 07:31:52] [INFO] executing notification command for event "repmgrd_shutdown"
[2020-03-23 07:31:52] [DETAIL] command is:
  /usr/local/bin/cluster/repmgr/events/router.sh 1000 repmgrd_shutdown 0 "2020-03-23 07:31:52+0000" "node is inactive and cannot be used as a failover target"
[2020-03-23 07:31:52] [INFO] repmgrd terminating...

yee379 avatar Mar 23 '20 08:03 yee379

I am in the same boat....

I am running 2 servers 1 server has the master and a slave. The second server has 2 more slaves and the pool and backup.

When I shutdown server 1. Master goes down and a slave on the second server get's promoted... But I also lose one of the slaves because it is pointing at the old master...

I spool up server 1 Master cannot join because it is no longer master. The slave wont' start because pgmaster won't start...

So now server 1 has no running databases. Server 2 has only 1 running.

How do we rejoin the original master? How do we ensure we have all 4 nodes up?

What am I missing here? All the the test cases in your docs show to the point when a node is promoted... But nothing for the steps to take when putting things back up.

Any information surrounding this is greatly appreciated! I will keep hacking at it...

webdobe avatar Feb 24 '21 22:02 webdobe

I am also experiencing this issue. I don't think i have anything to add that @yee379 hasn't already covered. @yee379 its been over a year since you opened this issue, did you ever find a solution or work-around?

EDIT: For anyone else experiencing this issue, i've found removing the contents of the failed master's pg-data dir allows the failed master to startup and join the cluster again.

smokes2345 avatar Jul 08 '21 15:07 smokes2345

@paunin it looks like PR #261 might fix this.

smokes2345 avatar Jul 08 '21 15:07 smokes2345