PostDock
PostDock copied to clipboard
pgmaster fails after restart
I am using the docker-compose with 1 pgmaster and 1 pgslave.
After all containers started, I stop pgmaster with docker stop
When I tried docker start
pgmaster again, it will fail.
Same happens when I docker-compose restart
I basically just removed pgslave2-4 from postgres-10_repmgr-4.0_pgpool-3.7_barman-2.4.yml
Any more configs are needed?
pgmaster logs after restart:
pgmaster_1 | >>>>>> RECOVERY_WAL_ID is empty!
pgmaster_1 | >>> Not in recovery state (anymore)
pgmaster_1 | >>> Waiting for local postgres server start...
pgmaster_1 | >>> Wait schema replication_db.public on pgmaster:5432(user: replication_user,password: *******), will try 9 times with delay 10 seconds (TIMEOUT=90)
pgmaster_1 | >>>>>> Schema replication_db.public exists on host pgmaster:5432!
pgmaster_1 | >>> Unregister the node if it was done before
pgmaster_1 | DELETE 0
pgmaster_1 | >>> Registering node with role standby
pgmaster_1 | INFO: connecting to local node "node1" (ID: 1)
pgmaster_1 | ERROR: this node should be a standby (user=replication_user password=replication_pass host=pgmaster dbname=replication_db port=5432 connect_timeout=2)
pgmaster_1 | >>> Starting repmgr daemon...
pgmaster_1 | [2018-07-12 08:26:46] [NOTICE] repmgrd (repmgr 4.0.6) starting up
pgmaster_1 | [2018-07-12 08:26:46] [INFO] connecting to database "user=replication_user password=replication_pass host=pgmaster dbname=replication_db port=5432 connect_timeout=2"
pgmaster_1 | INFO: looking for configuration file in /etc
pgmaster_1 | INFO: configuration file found at: "/etc/repmgr.conf"
pgmaster_1 | [2018-07-12 08:26:46] [ERROR] this node is marked as inactive and cannot be used as a failover target
pgmaster_1 | [2018-07-12 08:26:46] [HINT] Check that "repmgr (primary|standby) register" was executed for this node
pgmaster_1 | [2018-07-12 08:26:46] [INFO] executing notification command for event "repmgrd_shutdown"
pgmaster_1 | [2018-07-12 08:26:46] [DETAIL] command is:
pgmaster_1 | /usr/local/bin/cluster/repmgr/events/router.sh 1 repmgrd_shutdown 0 "2018-07-12 08:26:46+0000" "node is inactive and cannot be used as a failover target"
pgmaster_1 | [2018-07-12 08:26:46] [INFO] repmgrd terminating...
pgmaster_1 | >>> Foreground processes returned code: '1'
dev_pgmaster_1 exited with code 0
Hello.
Same issue. In postgres-10_pgpool-3.7_barman-2.4.yml it works.
I found some more errors (in the log) in new build (postgres-10_repmgr-4.0_pgpool-3.7_barman-2.4.yml) when restart the master. Maybe they are connected.
- During command(file do_rewind) "gosu postgres repmgr standby archive-config --config-archive-dir=/tmp/repmgr-archive"
>>>>>> Archiving configs
The following command line errors were encountered:
unknown repmgr action 'standby archive-config'
Try "repmgr --help" for more information.
In postgres-10_pgpool-3.7_barman-2.4.yml logs was:
>>>>>> Archiving configs
ERROR: connection to database failed: could not connect to server: Connection refused
Is the server running on host "pgmaster" (172.19.0.4) and accepting
TCP/IP connections on port 5432?
- During command(file do_rewind) "gosu postgres repmgr standby restore-config -D $PGDATA --config-archive-dir=/tmp/repmgr-archive"
>>>>>> Restoring configs
The following command line errors were encountered:
unknown repmgr action 'standby restore-config'
Try "repmgr --help" for more information.
- During command(file do_rewind) "gosu postgres repmgr -h $CURRENT_REPLICATION_PRIMARY_HOST -p $REPLICATION_PRIMARY_PORT -d $REPLICATION_DB -U $REPLICATION_USER -D $PGDATA standby follow -W --log-level DEBUG --verbose"
>>>>>> Tell repmgr to follow upstream for the node
DEBUG: do_standby_follow()
DEBUG: connecting to: "user=replication_user password=replication_pass connect_timeout=2 dbname=replication_db host=pgmaster port=5432 fallback_application_name=repmgr"
ERROR: connection to database failed:
could not connect to server: Connection refused
Is the server running on host "pgmaster" (172.19.0.2) and accepting
TCP/IP connections on port 5432?
DETAIL: attempted to connect using:
user=replication_user password=replication_pass connect_timeout=2 dbname=replication_db host=pgmaster port=5432 fallback_application_name=repmgr
HINT: use "repmgr node rejoin" to re-add an inactive node to the replication cluster
INFO: looking for configuration file in /etc
INFO: configuration file found at: "/etc/repmgr.conf"
- In the do_rewind file, you are trying to execute a command "rm -rf $PGDATA/pg_xlog/archive_status/", but in the postgres10 folder pg_xlog is no longer there. It is renamed pg_wal.
+1
Same issue here, I've fixed it using the following process:
In this example ex-primary node is pgmaster
and newly elected primary node is pgslave3
- on
pgmaster
remove everything under/var/lib/postgresql/data/
and restart it - attach
pgmaster
topgpool
usingpcp_attach_node -h localhost -U pcp_user 0
where0
ispgmaster
node id - Change barman service
REPLICATION_HOST
environment variable to point topgslave3
. - Clean barman replication slot on
pgmaster
using:SELECT pg_drop_replication_slot('barman_the_backupper')
Hi Same issue here. Sample issue scenario:
-
docker-compose up -d
:
version: "3.3"
services:
pgmaster:
build:
context: ./src
dockerfile: Postgres-10-Repmgr-4.0.Dockerfile
environment:
NODE_ID: 1 # Integer number of node (not required if can be extracted from NODE_NAME var, e.g. node-45 => 1045)
NODE_NAME: node1 # Node name
CLUSTER_NODE_NETWORK_NAME: pgmaster # (default: hostname of the node)
PARTNER_NODES: "pgmaster,pgslave1,pgslave2"
REPLICATION_PRIMARY_HOST: pgmaster # That should be ignored on the same node
NODE_PRIORITY: 100 # (default: 100)
#database we want to use for application
POSTGRES_PASSWORD: monkey_pass
POSTGRES_USER: monkey_user
POSTGRES_DB: monkey_db
CLEAN_OVER_REWIND: 0
CONFIGS_DELIMITER_SYMBOL: ;
CONFIGS: "listen_addresses:'*';max_replication_slots:5;wal_keep_segments:250;shared_buffers:300MB;archive_command:'/bin/true'"
# in format variable1:value1[,variable2:value2[,...]] if CONFIGS_DELIMITER_SYMBOL=, and CONFIGS_ASSIGNMENT_SYMBOL=:
# used for pgpool.conf file
#defaults:
CLUSTER_NAME: pg_cluster # default is pg_cluster
REPLICATION_DB: replication_db # default is replication_db
REPLICATION_USER: replication_user # default is replication_user
REPLICATION_PASSWORD: replication_pass # default is replication_pass
REPMGR_WAIT_POSTGRES_START_TIMEOUT: 600
expose:
- 5432
volumes:
- pgmaster:/var/lib/postgresql/data
networks:
cluster:
aliases:
- pgmaster
pgslave1:
build:
context: ./src
dockerfile: Postgres-10-Repmgr-4.0.Dockerfile
environment:
NODE_ID: 2
NODE_NAME: node2
CLUSTER_NODE_NETWORK_NAME: pgslave1 # (default: hostname of the node)
PARTNER_NODES: "pgmaster,pgslave1,pgslave2"
REPLICATION_PRIMARY_HOST: pgmaster
NODE_PRIORITY: 200
CLEAN_OVER_REWIND: 1
REPMGR_WAIT_POSTGRES_START_TIMEOUT: 600
expose:
- 5432
volumes:
- pgslave1:/var/lib/postgresql/data
networks:
cluster:
aliases:
- pgslave1
pgslave2:
build:
context: ./src
dockerfile: Postgres-10-Repmgr-4.0.Dockerfile
environment:
NODE_ID: 4
NODE_NAME: node4
CLUSTER_NODE_NETWORK_NAME: pgslave2 # (default: hostname of the node)
PARTNER_NODES: "pgmaster,pgslave1,pgslave2"
REPLICATION_PRIMARY_HOST: pgmaster
NODE_PRIORITY: 300 # (default: 100)
CLEAN_OVER_REWIND: 1
REPMGR_WAIT_POSTGRES_START_TIMEOUT: 600
expose:
- 5432
volumes:
- pgslave2:/var/lib/postgresql/data
networks:
cluster:
aliases:
- pgslave2
pgpool:
build:
context: ./src
dockerfile: Pgpool-3.7-Postgres-10.Dockerfile
environment:
PCP_USER: pcp_user
PCP_PASSWORD: pcp_pass
WAIT_BACKEND_TIMEOUT: 60
CHECK_USER: monkey_user
CHECK_PASSWORD: monkey_pass
CHECK_PGCONNECT_TIMEOUT: 3 #timout for checking if primary node is healthy
DB_USERS: monkey_user:monkey_pass # in format user:password[,user:password[...]]
BACKENDS: "0:pgmaster:5432:1:/var/lib/postgresql/data:ALLOW_TO_FAILOVER,1:pgslave1::::,2:pgslave2::::"
# in format num:host:port:weight:data_directory:flag[,...]
# defaults:
# port: 5432
# weight: 1
# data_directory: /var/lib/postgresql/data
# flag: ALLOW_TO_FAILOVER
REQUIRE_MIN_BACKENDS: 3 # minimal number of backends to start pgpool (some might be unreachable)
CONFIGS: "num_init_children:250,max_pool:4,client_idle_limit:900,connection_life_time:300"
# in format variable1:value1[,variable2:value2[,...]] if CONFIGS_DELIMITER_SYMBOL=, and CONFIGS_ASSIGNMENT_SYMBOL=:
# used for pgpool.conf file
expose:
- 5432
- 9898
networks:
cluster:
aliases:
- pgpool
networks:
cluster:
driver: bridge
volumes:
pgmaster:
pgslave1:
pgslave2:
- Add some data to the DB.
-
docker-compose stop pgmaster
and wait few seconds. -
docker-compose start pgmaster
Then every time we start
the pgmaster, this container will be stoped with this error:
...
2019-01-23 13:33:16.995 UTC [266] LOG: database system is ready to accept connections
>>>>>> RECOVERY_WAL_ID is empty!
>>> Not in recovery state (anymore)
>>> Waiting for local postgres server start...
>>> Wait schema replication_db.public on pgmaster:5432(user: replication_user,password: *******), will try 60 times with delay 10 seconds (TIMEOUT=600)
>>>>>> Schema replication_db.public exists on host pgmaster:5432!
>>> Unregister the node if it was done before
DELETE 0
>>> Registering node with role standby
INFO: connecting to local node "node1" (ID: 1)
ERROR: this node should be a standby (user=replication_user password=replication_pass host=pgmaster dbname=replication_db port=5432 connect_timeout=2)
>>> Starting repmgr daemon...
[2019-01-23 13:33:47] [NOTICE] repmgrd (repmgr 4.0.6) starting up
[2019-01-23 13:33:47] [INFO] connecting to database "user=replication_user password=replication_pass host=pgmaster dbname=replication_db port=5432 connect_timeout=2"
INFO: looking for configuration file in /etc
INFO: configuration file found at: "/etc/repmgr.conf"
[2019-01-23 13:33:47] [ERROR] this node is marked as inactive and cannot be used as a failover target
[2019-01-23 13:33:47] [HINT] Check that "repmgr (primary|standby) register" was executed for this node
[2019-01-23 13:33:47] [INFO] executing notification command for event "repmgrd_shutdown"
[2019-01-23 13:33:47] [DETAIL] command is:
/usr/local/bin/cluster/repmgr/events/router.sh 1 repmgrd_shutdown 0 "2019-01-23 13:33:47+0000" "node is inactive and cannot be used as a failover target"
[2019-01-23 13:33:47] [INFO] repmgrd terminating...
Hi mohsenasm.
Same issue. You fixed? Please help me!!!
Hi thien281087,
Nope, nothing changed.
Hi,
It seems that in repmgr4.0 these actions are no long there....
- repmgr standby archive-config --config-archive-dir=/tmp/repmgr-archive
- repmgr standby restore-config -D $PGDATA --config-archive-dir=/tmp/repmgr-archive
>>>>>> Archiving configs
The following command line errors were encountered:
unknown repmgr action 'standby archive-config'
Try "repmgr --help" for more information.
>>>>>> Restoring configs
The following command line errors were encountered:
unknown repmgr action 'standby restore-config'
Try "repmgr --help" for more information.
So basically when the master fails it returns back as master again...
And also, folder pg_xlog is renamed to pg_wal....
echo ">>>>>> Start server to be able to rewind (weird hack to avoid dirty shutdown issue)"
rm -rf $PGDATA/pg_xlog/archive_status/
Did anyone found a solution for this?
Hi, also same issue me too(k8s/example2-single-statefulset), i referred to your says and found resolve
@msamichev
Hello.
Same issue. In postgres-10_pgpool-3.7_barman-2.4.yml it works.
I found some more errors (in the log) in new build (postgres-10_repmgr-4.0_pgpool-3.7_barman-2.4.yml) when restart the master. Maybe they are connected.
- During command(file do_rewind) "gosu postgres repmgr standby archive-config --config-archive-dir=/tmp/repmgr-archive"
>>>>>> Archiving configs The following command line errors were encountered: unknown repmgr action 'standby archive-config' Try "repmgr --help" for more information.
In postgres-10_pgpool-3.7_barman-2.4.yml logs was:
>>>>>> Archiving configs ERROR: connection to database failed: could not connect to server: Connection refused Is the server running on host "pgmaster" (172.19.0.4) and accepting TCP/IP connections on port 5432?
- During command(file do_rewind) "gosu postgres repmgr standby restore-config -D $PGDATA --config-archive-dir=/tmp/repmgr-archive"
>>>>>> Restoring configs The following command line errors were encountered: unknown repmgr action 'standby restore-config' Try "repmgr --help" for more information.
- During command(file do_rewind) "gosu postgres repmgr -h $CURRENT_REPLICATION_PRIMARY_HOST -p $REPLICATION_PRIMARY_PORT -d $REPLICATION_DB -U $REPLICATION_USER -D $PGDATA standby follow -W --log-level DEBUG --verbose"
>>>>>> Tell repmgr to follow upstream for the node DEBUG: do_standby_follow() DEBUG: connecting to: "user=replication_user password=replication_pass connect_timeout=2 dbname=replication_db host=pgmaster port=5432 fallback_application_name=repmgr" ERROR: connection to database failed: could not connect to server: Connection refused Is the server running on host "pgmaster" (172.19.0.2) and accepting TCP/IP connections on port 5432? DETAIL: attempted to connect using: user=replication_user password=replication_pass connect_timeout=2 dbname=replication_db host=pgmaster port=5432 fallback_application_name=repmgr HINT: use "repmgr node rejoin" to re-add an inactive node to the replication cluster INFO: looking for configuration file in /etc INFO: configuration file found at: "/etc/repmgr.conf"
- In the do_rewind file, you are trying to execute a command "rm -rf $PGDATA/pg_xlog/archive_status/", but in the postgres10 folder pg_xlog is no longer there. It is renamed pg_wal.
Looking at the log when restarting the masternode, i found error at repmgr. I saw that the creator of this project had updated several times, so I thought it would have worked fine before, and I lowered the version of repmgr, postgres, and pgpool. (Although this is not an exact solution, I couldn't change the docker file, so i tried lowering the version)
(refer site https://hub.docker.com/r/postdock/pgpool https://hub.docker.com/r/postdock/postgres)
change file node.yml pgpool.yml
-orgin tag postgres:latest-postgres11-repmgr40 pgpool:latest-pgpool37-postgres11
-changed tag. postgres:latest-postgrese10-repmgr32 pgpool:latest-pgpool37-postgres10 => it fine work.
Thank you
having the same issue here... @WnP 's solution might work but it's a PITA to do every time a node is killed.
I will try to downgrade to Postgres-10-Repmgr-3.2 to see if it works
Hitting same problem. Unable to use repmgr rejoin node as well as I get errors like:
ERROR: database is still running in state "in production"
HINT: "repmgr node rejoin" cannot be executed on a running node
INFO: looking for configuration file in /etc
INFO: configuration file found at: "/etc/repmgr.conf"
But how can you run the rejoin command if the container isn't up?