PostDock
PostDock copied to clipboard
killing primary results in it not being able to rejoin
i have postdock spun up on kubernetes. its working great! however, the first test i did was to kill the first statefulset replica by running kubectl delete pod/postgres-0
(yeah, i renamed everything).
so before the killing of the primary:
postgres@postgres-2:/$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Connection string
----+------------+---------+-----------+------------+----------+--------------------------------------------------------------------------------------------
1000 | postgres-0 | primary | * running | | default | user=repl password=replicate host=postgres-0 dbname=replica_db port=5432 connect_timeout=2
1001 | postgres-1 | standby | running | postgres-0 | default | user=repl password=replicate host=postgres-1 dbname=replica_db port=5432 connect_timeout=2
1002 | postgres-2 | standby | running | postgres-0 | default | user=repl password=replicate host=postgres-2 dbname=replica_db port=5432 connect_timeout=2
and then after i get
postgres@postgres-2:/$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Connection string
----+------------+---------+-----------+------------+----------+--------------------------------------------------------------------------------------------
1001 | postgres-1 | primary | * running | | default | user=repl password=replicate host=postgres-1 dbname=replica_db port=5432 connect_timeout=2
1002 | postgres-2 | standby | running | postgres-1 | default | user=repl password=replicate host=postgres-2 dbname=replica_db port=5432 connect_timeout=2
postres-1 took over primary... good! but postgres-0
is completely missing... the logs from postgres-0
❯ k logs postgres-0 -f
>>> Setting up STOP handlers...
>>> STARTING SSH (if required)...
>>> SSH is not enabled!
>>> STARTING POSTGRES...
>>> SETTING UP POLYMORPHIC VARIABLES (repmgr=3+postgres=9 | repmgr=4, postgres=10)...
>>> TUNING UP POSTGRES...
>>> Configuring /var/lib/postgresql/data/postgresql.conf
>>>>>> Will add configs to the exists file
>>>>>> Adding config 'wal_keep_segments'='250'
>>>>>> Adding config 'shared_buffers'='300MB'
>>>>>> Adding config 'archive_command'=''/bin/true''
>>>>>> Adding config 'shared_preload_libraries'=''repmgr''
>>> Check all partner nodes for common upstream node...
>>>>>> Checking NODE=postgres-0.postgres...
psql: could not connect to server: Connection refused
Is the server running on host "postgres-0.postgres" (10.40.0.13) and accepting
TCP/IP connections on port 5432?
>>>>>> Skipping: failed to get master from the node!
>>>>>> Checking NODE=postgres-1.postgres...
>>>>>>>>> Pretending master role node - postgres-0.postgres
>>>>>> Checking NODE=postgres-2.postgres...
>>>>>>>>> Pretending master role node - postgres-0.postgres
>>> Auto-detected master name: 'postgres-0.postgres'
>>> Setting up repmgr...
>>> Setting up repmgr config file '/etc/repmgr.conf'...
>>> Setting up upstream node...
cat: /var/lib/postgresql/data/standby.lock: No such file or directory
>>> Previously Locked standby upstream node LOCKED_STANDBY=''
>>> Waiting for upstream postgres server...
>>> Wait schema replica_db.repmgr on postgres-0.postgres:5432(user: repl,password: *******), will try 30 times with delay 10 seconds (TIMEOUT=300)
psql: could not connect to server: Connection refused
Is the server running on host "postgres-0.postgres" (10.40.0.13) and accepting
TCP/IP connections on port 5432?
[snip!]
>>>>>> Host postgres-0.postgres:5432 is not accessible (will try 2 times more)
psql: could not connect to server: Connection refused
Is the server running on host "postgres-0.postgres" (10.40.0.13) and accepting
TCP/IP connections on port 5432?
>>>>>> Host postgres-0.postgres:5432 is not accessible (will try 1 times more)
>>> Schema replica_db.repmgr is not accessible, even after 30 tries!
Then it crashes and upon restart (ad infinitum...)
❯ k logs postgres-0 -f
>>> Setting up STOP handlers...
>>> STARTING SSH (if required)...
>>> SSH is not enabled!
>>> STARTING POSTGRES...
>>> SETTING UP POLYMORPHIC VARIABLES (repmgr=3+postgres=9 | repmgr=4, postgres=10)...
>>> TUNING UP POSTGRES...
>>> Configuring /var/lib/postgresql/data/postgresql.conf
>>>>>> Will add configs to the exists file
>>>>>> Adding config 'wal_keep_segments'='250'
>>>>>> Adding config 'shared_buffers'='300MB'
>>>>>> Adding config 'archive_command'=''/bin/true''
>>>>>> Adding config 'shared_preload_libraries'=''repmgr''
>>> Check all partner nodes for common upstream node...
>>>>>> Checking NODE=postgres-0.postgres...
psql: could not connect to server: Connection refused
Is the server running on host "postgres-0.postgres" (10.40.0.13) and accepting
TCP/IP connections on port 5432?
>>>>>> Skipping: failed to get master from the node!
>>>>>> Checking NODE=postgres-1.postgres...
>>>>>>>>> Pretending master role node - postgres-1.postgres
>>>>>> Checking NODE=postgres-2.postgres...
>>>>>>>>> Pretending master role node - postgres-1.postgres
>>> Auto-detected master name: 'postgres-1.postgres'
>>> Setting up repmgr...
>>> Setting up repmgr config file '/etc/repmgr.conf'...
>>> Setting up upstream node...
cat: /var/lib/postgresql/data/standby.lock: No such file or directory
>>> Previously Locked standby upstream node LOCKED_STANDBY=''
>>> Waiting for upstream postgres server...
>>> Wait schema replica_db.repmgr on postgres-1.postgres:5432(user: repl,password: *******), will try 30 times with delay 10 seconds (TIMEOUT=300)
>>>>>> Schema replica_db.repmgr exists on host postgres-1.postgres:5432!
>>> Waiting for upstream node id replica_db on postgres-1.postgres:5432(user: repl,password: *******), will try 30 times with delay 10 seconds (TIMEOUT=300)
>>>>>> Upstream node id 1001 received from host postgres-1.postgres:5432!
>>> REPLICATION_UPSTREAM_NODE_ID=1001
>>> Sending in background postgres start...
>>> Waiting for upstream postgres server...
>>> Wait schema replica_db.repmgr on postgres-1.postgres:5432(user: repl,password: *******), will try 30 times with delay 10 seconds (TIMEOUT=300)
>>>>>> Schema replica_db.repmgr exists on host postgres-1.postgres:5432!
>>> Data folder is not empty /var/lib/postgresql/data:
total 136
drwx------ 19 postgres tape 4096 Mar 23 07:31 .
drwxr-xr-x 1 postgres postgres 4096 Mar 23 07:31 ..
-rwx------ 1 postgres postgres 175 Mar 23 07:29 backup_label.old
drwx------ 7 postgres postgres 4096 Mar 23 07:17 base
drwx------ 2 postgres postgres 4096 Mar 23 07:30 global
drwx------ 2 postgres postgres 4096 Mar 23 07:17 pg_commit_ts
drwx------ 2 postgres postgres 4096 Mar 23 07:17 pg_dynshmem
-rwx------ 1 postgres postgres 4571 Mar 23 07:29 pg_hba.conf
-rwx------ 1 postgres postgres 1636 Mar 23 07:29 pg_ident.conf
drwx------ 4 postgres postgres 4096 Mar 23 07:30 pg_logical
drwx------ 4 postgres postgres 4096 Mar 23 07:17 pg_multixact
drwx------ 2 postgres postgres 4096 Mar 23 07:30 pg_notify
drwx------ 2 postgres postgres 4096 Mar 23 07:29 pg_replslot
drwx------ 2 postgres postgres 4096 Mar 23 07:17 pg_serial
drwx------ 2 postgres postgres 4096 Mar 23 07:17 pg_snapshots
drwx------ 2 postgres postgres 4096 Mar 23 07:30 pg_stat
drwx------ 2 postgres postgres 4096 Mar 23 07:30 pg_stat_tmp
drwx------ 2 postgres postgres 4096 Mar 23 07:30 pg_subtrans
drwx------ 2 postgres postgres 4096 Mar 23 07:17 pg_tblspc
drwx------ 2 postgres postgres 4096 Mar 23 07:17 pg_twophase
-rwx------ 1 postgres postgres 3 Mar 23 07:17 PG_VERSION
drwx------ 3 postgres postgres 4096 Mar 23 07:29 pg_wal
drwx------ 2 postgres postgres 4096 Mar 23 07:17 pg_xact
-rwx------ 1 postgres postgres 88 Mar 23 07:29 postgresql.auto.conf
-rwx------ 1 postgres postgres 21646 Mar 23 07:31 postgresql.conf
-rwx------ 1 postgres postgres 36 Mar 23 07:30 postmaster.opts
-rwx------ 1 postgres postgres 96 Mar 23 07:30 postmaster.pid
-rwx------ 1 postgres postgres 232 Mar 23 07:29 recovery.done
>>> Starting standby node...
>>> Instance has been set up already.
>>> Rewinding to the latest state
>>>>>> Archiving configs
The following command line errors were encountered:
unknown repmgr action 'standby archive-config'
Try "repmgr --help" for more information.
>>>>>> Start server to be able to rewind (weird hack to avoid dirty shutdown issue)
waiting for server to start....2020-03-23 07:31:21.567 UTC [189] LOG: listening on IPv4 address "127.0.0.1", port 5432
2020-03-23 07:31:21.567 UTC [189] LOG: could not bind IPv6 address "::1": Cannot assign requested address
2020-03-23 07:31:21.567 UTC [189] HINT: Is another postmaster already running on port 5432? If not, wait a few seconds and retry.
2020-03-23 07:31:21.568 UTC [189] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2020-03-23 07:31:21.597 UTC [190] LOG: database system was interrupted; last known up at 2020-03-23 07:30:19 UTC
2020-03-23 07:31:21.626 UTC [190] LOG: database system was not properly shut down; automatic recovery in progress
2020-03-23 07:31:21.627 UTC [190] LOG: redo starts at 0/5026958
2020-03-23 07:31:21.628 UTC [190] LOG: redo done at 0/5026990
2020-03-23 07:31:21.638 UTC [189] LOG: database system is ready to accept connections
done
server started
>>>>>> Removing unactive replication slots of partners
>>> Getting slots from master node
repmgr_slot_1002
>>> Closing slots which exist on master node
>>>>>> Closing slot repmgr_slot_1002
2020-03-23 07:31:21.809 UTC [226] ERROR: replication slot "repmgr_slot_1002" does not exist
2020-03-23 07:31:21.809 UTC [226] STATEMENT: SELECT pg_drop_replication_slot('repmgr_slot_1002')
ERROR: replication slot "repmgr_slot_1002" does not exist
>>>>>>>>> Can't close the slot!
>>> Closing slot which current master used on the node
>>>>>> Closing slot repmgr_slot_1001
2020-03-23 07:31:21.989 UTC [257] ERROR: replication slot "repmgr_slot_1001" does not exist
2020-03-23 07:31:21.989 UTC [257] STATEMENT: SELECT pg_drop_replication_slot('repmgr_slot_1001');
ERROR: replication slot "repmgr_slot_1001" does not exist
>>>>>>>>> Can't close the slot!
>>>>>> Stop server
2020-03-23 07:31:21.996 UTC [189] LOG: received fast shutdown request
waiting for server to shut down....2020-03-23 07:31:21.997 UTC [189] LOG: aborting any active transactions
2020-03-23 07:31:22.000 UTC [189] LOG: background worker "logical replication launcher" (PID 196) exited with exit code 1
2020-03-23 07:31:22.000 UTC [191] LOG: shutting down
2020-03-23 07:31:22.016 UTC [189] LOG: database system is shut down
done
server stopped
>>>>>> Creating replication slot for the standby in case it's not there
cat: /var/lib/postgresql/data/recovery.conf: No such file or directory
>>>>>> Rewind to the host postgres-1.postgres
source and target cluster are on the same timeline
no rewind required
>>>>>> Restoring configs
The following command line errors were encountered:
unknown repmgr action 'standby restore-config'
Try "repmgr --help" for more information.
>>>>>> Tell repmgr to follow upstream for the node
DEBUG: do_standby_follow()
DEBUG: connecting to: "user=repl password=replicate connect_timeout=2 dbname=replica_db host=postgres-0.postgres port=5432 fallback_application_name=repmgr"
ERROR: connection to database failed:
could not connect to server: Connection refused
Is the server running on host "postgres-0.postgres" (10.40.0.13) and accepting
TCP/IP connections on port 5432?
DETAIL: attempted to connect using:
user=repl password=replicate connect_timeout=2 dbname=replica_db host=postgres-0.postgres port=5432 fallback_application_name=repmgr
HINT: use "repmgr node rejoin" to re-add an inactive node to the replication cluster
INFO: looking for configuration file in /etc
INFO: configuration file found at: "/etc/repmgr.conf"
>>>>>> Stop server
pg_ctl: PID file "/var/lib/postgresql/data/postmaster.pid" does not exist
Is server running?
>>> Configuring /var/lib/postgresql/data/postgresql.conf
>>>>>> Will add configs to the exists file
>>>>>> Adding config 'wal_keep_segments'='250'
>>>>>> Adding config 'shared_buffers'='300MB'
>>>>>> Adding config 'archive_command'=''/bin/true''
>>>>>> Adding config 'shared_preload_libraries'=''repmgr''
>>> Starting postgres...
>>> Waiting for local postgres server recovery if any in progress:LAUNCH_RECOVERY_CHECK_INTERVAL=30
>>> Recovery is in progress:
2020-03-23 07:31:22.308 UTC [309] LOG: listening on IPv4 address "0.0.0.0", port 5432
2020-03-23 07:31:22.308 UTC [309] LOG: listening on IPv6 address "::", port 5432
2020-03-23 07:31:22.309 UTC [309] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2020-03-23 07:31:22.339 UTC [320] LOG: database system was shut down at 2020-03-23 07:31:22 UTC
2020-03-23 07:31:22.345 UTC [309] LOG: database system is ready to accept connections
>>>>>> RECOVERY_WAL_ID is empty!
>>> Not in recovery state (anymore)
>>> Waiting for local postgres server start...
>>> Wait schema replica_db.public on postgres-0.postgres:5432(user: repl,password: *******), will try 60 times with delay 10 seconds (TIMEOUT=600)
>>>>>> Schema replica_db.public exists on host postgres-0.postgres:5432!
>>> Unregister the node if it was done before
DELETE 0
>>> Registering node with role standby
INFO: connecting to local node "postgres-0" (ID: 1000)
ERROR: this node should be a standby (user=repl password=replicate host=postgres-0.postgres dbname=replica_db port=5432 connect_timeout=2)
>>> Starting repmgr daemon...
INFO: looking for configuration file in /etc
INFO: configuration file found at: "/etc/repmgr.conf"
[2020-03-23 07:31:52] [NOTICE] repmgrd (repmgr 4.0.6) starting up
[2020-03-23 07:31:52] [INFO] connecting to database "user=repl password=replicate host=postgres-0.postgres dbname=replica_db port=5432 connect_timeout=2"
[2020-03-23 07:31:52] [ERROR] this node is marked as inactive and cannot be used as a failover target
[2020-03-23 07:31:52] [HINT] Check that "repmgr (primary|standby) register" was executed for this node
[2020-03-23 07:31:52] [INFO] executing notification command for event "repmgrd_shutdown"
[2020-03-23 07:31:52] [DETAIL] command is:
/usr/local/bin/cluster/repmgr/events/router.sh 1000 repmgrd_shutdown 0 "2020-03-23 07:31:52+0000" "node is inactive and cannot be used as a failover target"
[2020-03-23 07:31:52] [INFO] repmgrd terminating...
I am in the same boat....
I am running 2 servers 1 server has the master and a slave. The second server has 2 more slaves and the pool and backup.
When I shutdown server 1. Master goes down and a slave on the second server get's promoted... But I also lose one of the slaves because it is pointing at the old master...
I spool up server 1 Master cannot join because it is no longer master. The slave wont' start because pgmaster won't start...
So now server 1 has no running databases. Server 2 has only 1 running.
How do we rejoin the original master? How do we ensure we have all 4 nodes up?
What am I missing here? All the the test cases in your docs show to the point when a node is promoted... But nothing for the steps to take when putting things back up.
Any information surrounding this is greatly appreciated! I will keep hacking at it...
I am also experiencing this issue. I don't think i have anything to add that @yee379 hasn't already covered. @yee379 its been over a year since you opened this issue, did you ever find a solution or work-around?
EDIT: For anyone else experiencing this issue, i've found removing the contents of the failed master's pg-data dir allows the failed master to startup and join the cluster again.
@paunin it looks like PR #261 might fix this.