PostDock pgmaster fails after restart

pgmaster fails after restart

Open jewelzqiu opened this issue 5 years ago • 10 comments

I am using the docker-compose with 1 pgmaster and 1 pgslave. After all containers started, I stop pgmaster with docker stop When I tried docker start pgmaster again, it will fail. Same happens when I docker-compose restart

I basically just removed pgslave2-4 from postgres-10_repmgr-4.0_pgpool-3.7_barman-2.4.yml Any more configs are needed?

pgmaster logs after restart:

pgmaster_1  | >>>>>> RECOVERY_WAL_ID is empty!
pgmaster_1  | >>> Not in recovery state (anymore)
pgmaster_1  | >>> Waiting for local postgres server start...
pgmaster_1  | >>> Wait schema replication_db.public on pgmaster:5432(user: replication_user,password: *******), will try 9 times with delay 10 seconds (TIMEOUT=90)
pgmaster_1  | >>>>>> Schema replication_db.public exists on host pgmaster:5432!
pgmaster_1  | >>> Unregister the node if it was done before
pgmaster_1  | DELETE 0
pgmaster_1  | >>> Registering node with role standby
pgmaster_1  | INFO: connecting to local node "node1" (ID: 1)
pgmaster_1  | ERROR: this node should be a standby (user=replication_user password=replication_pass host=pgmaster dbname=replication_db port=5432 connect_timeout=2)
pgmaster_1  | >>> Starting repmgr daemon...
pgmaster_1  | [2018-07-12 08:26:46] [NOTICE] repmgrd (repmgr 4.0.6) starting up
pgmaster_1  | [2018-07-12 08:26:46] [INFO] connecting to database "user=replication_user password=replication_pass host=pgmaster dbname=replication_db port=5432 connect_timeout=2"
pgmaster_1  | INFO: looking for configuration file in /etc
pgmaster_1  | INFO: configuration file found at: "/etc/repmgr.conf"
pgmaster_1  | [2018-07-12 08:26:46] [ERROR] this node is marked as inactive and cannot be used as a failover target
pgmaster_1  | [2018-07-12 08:26:46] [HINT] Check that "repmgr (primary|standby) register" was executed for this node
pgmaster_1  | [2018-07-12 08:26:46] [INFO] executing notification command for event "repmgrd_shutdown"
pgmaster_1  | [2018-07-12 08:26:46] [DETAIL] command is:
pgmaster_1  |   /usr/local/bin/cluster/repmgr/events/router.sh 1 repmgrd_shutdown 0 "2018-07-12 08:26:46+0000" "node is inactive and cannot be used as a failover target"
pgmaster_1  | [2018-07-12 08:26:46] [INFO] repmgrd terminating...
pgmaster_1  | >>> Foreground processes returned code: '1'
dev_pgmaster_1 exited with code 0

Jul 12 '18 08:07 jewelzqiu

Hello.

Same issue. In postgres-10_pgpool-3.7_barman-2.4.yml it works.

I found some more errors (in the log) in new build (postgres-10_repmgr-4.0_pgpool-3.7_barman-2.4.yml) when restart the master. Maybe they are connected.

During command(file do_rewind) "gosu postgres repmgr standby archive-config --config-archive-dir=/tmp/repmgr-archive"

>>>>>> Archiving configs
The following command line errors were encountered:
  unknown repmgr action 'standby archive-config'
Try "repmgr --help" for more information.

In postgres-10_pgpool-3.7_barman-2.4.yml logs was:

>>>>>> Archiving configs
ERROR: connection to database failed: could not connect to server: Connection refused
        Is the server running on host "pgmaster" (172.19.0.4) and accepting
        TCP/IP connections on port 5432?

During command(file do_rewind) "gosu postgres repmgr standby restore-config -D $PGDATA --config-archive-dir=/tmp/repmgr-archive"

>>>>>> Restoring configs
The following command line errors were encountered:
  unknown repmgr action 'standby restore-config'
Try "repmgr --help" for more information.

During command(file do_rewind) "gosu postgres repmgr -h $CURRENT_REPLICATION_PRIMARY_HOST -p $REPLICATION_PRIMARY_PORT -d $REPLICATION_DB -U $REPLICATION_USER -D $PGDATA standby follow -W --log-level DEBUG --verbose"

>>>>>> Tell repmgr to follow upstream for the node
DEBUG: do_standby_follow()
DEBUG: connecting to: "user=replication_user password=replication_pass connect_timeout=2 dbname=replication_db host=pgmaster port=5432 fallback_application_name=repmgr"
ERROR: connection to database failed:
  could not connect to server: Connection refused
        Is the server running on host "pgmaster" (172.19.0.2) and accepting
        TCP/IP connections on port 5432?

DETAIL: attempted to connect using:
  user=replication_user password=replication_pass connect_timeout=2 dbname=replication_db host=pgmaster port=5432 fallback_application_name=repmgr
HINT: use "repmgr node rejoin" to re-add an inactive node to the replication cluster
INFO: looking for configuration file in /etc
INFO: configuration file found at: "/etc/repmgr.conf"

In the do_rewind file, you are trying to execute a command "rm -rf $PGDATA/pg_xlog/archive_status/", but in the postgres10 folder pg_xlog is no longer there. It is renamed pg_wal.

Jul 12 '18 13:07 msamichev

Aug 28 '18 14:08 stepan-romankov

Same issue here, I've fixed it using the following process:

In this example ex-primary node is pgmaster and newly elected primary node is pgslave3

on pgmaster remove everything under /var/lib/postgresql/data/ and restart it
attach pgmaster to pgpool using pcp_attach_node -h localhost -U pcp_user 0 where 0 is pgmaster node id
Change barman service REPLICATION_HOST environment variable to point to pgslave3.
Clean barman replication slot on pgmaster using: SELECT pg_drop_replication_slot('barman_the_backupper')

Nov 05 '18 17:11 WnP

Hi Same issue here. Sample issue scenario:

docker-compose up -d:

version: "3.3"

services:
    pgmaster:
        build:
            context: ./src
            dockerfile: Postgres-10-Repmgr-4.0.Dockerfile
        environment:
            NODE_ID: 1 # Integer number of node (not required if can be extracted from NODE_NAME var, e.g. node-45 => 1045)
            NODE_NAME: node1 # Node name
            CLUSTER_NODE_NETWORK_NAME: pgmaster # (default: hostname of the node)

            PARTNER_NODES: "pgmaster,pgslave1,pgslave2"
            REPLICATION_PRIMARY_HOST: pgmaster # That should be ignored on the same node

            NODE_PRIORITY: 100  # (default: 100)
            #database we want to use for application
            POSTGRES_PASSWORD: monkey_pass
            POSTGRES_USER: monkey_user
            POSTGRES_DB: monkey_db
            CLEAN_OVER_REWIND: 0
            CONFIGS_DELIMITER_SYMBOL: ;
            CONFIGS: "listen_addresses:'*';max_replication_slots:5;wal_keep_segments:250;shared_buffers:300MB;archive_command:'/bin/true'"
                                  # in format variable1:value1[,variable2:value2[,...]] if CONFIGS_DELIMITER_SYMBOL=, and CONFIGS_ASSIGNMENT_SYMBOL=:
                                  # used for pgpool.conf file
            #defaults:
            CLUSTER_NAME: pg_cluster # default is pg_cluster
            REPLICATION_DB: replication_db # default is replication_db
            REPLICATION_USER: replication_user # default is replication_user
            REPLICATION_PASSWORD: replication_pass # default is replication_pass
            REPMGR_WAIT_POSTGRES_START_TIMEOUT: 600
        expose:
            - 5432
        volumes:
            - pgmaster:/var/lib/postgresql/data
        networks:
            cluster:
                aliases:
                    - pgmaster

    pgslave1:
        build:
            context: ./src
            dockerfile: Postgres-10-Repmgr-4.0.Dockerfile
        environment:
            NODE_ID: 2
            NODE_NAME: node2
            CLUSTER_NODE_NETWORK_NAME: pgslave1 # (default: hostname of the node)
            PARTNER_NODES: "pgmaster,pgslave1,pgslave2"
            REPLICATION_PRIMARY_HOST: pgmaster
            NODE_PRIORITY: 200
            CLEAN_OVER_REWIND: 1
            REPMGR_WAIT_POSTGRES_START_TIMEOUT: 600
        expose:
            - 5432
        volumes:
            - pgslave1:/var/lib/postgresql/data
        networks:
            cluster:
                aliases:
                    - pgslave1

    pgslave2:
        build:
            context: ./src
            dockerfile: Postgres-10-Repmgr-4.0.Dockerfile
        environment:
            NODE_ID: 4
            NODE_NAME: node4
            CLUSTER_NODE_NETWORK_NAME: pgslave2 # (default: hostname of the node)
            PARTNER_NODES: "pgmaster,pgslave1,pgslave2"
            REPLICATION_PRIMARY_HOST: pgmaster
            NODE_PRIORITY: 300  # (default: 100)
            CLEAN_OVER_REWIND: 1
            REPMGR_WAIT_POSTGRES_START_TIMEOUT: 600
        expose:
            - 5432
        volumes:
            - pgslave2:/var/lib/postgresql/data
        networks:
            cluster:
                aliases:
                    - pgslave2

    pgpool:
        build:
            context: ./src
            dockerfile: Pgpool-3.7-Postgres-10.Dockerfile
        environment:
            PCP_USER: pcp_user
            PCP_PASSWORD: pcp_pass
            WAIT_BACKEND_TIMEOUT: 60

            CHECK_USER: monkey_user
            CHECK_PASSWORD: monkey_pass
            CHECK_PGCONNECT_TIMEOUT: 3 #timout for checking if primary node is healthy
            DB_USERS: monkey_user:monkey_pass # in format user:password[,user:password[...]]
            BACKENDS: "0:pgmaster:5432:1:/var/lib/postgresql/data:ALLOW_TO_FAILOVER,1:pgslave1::::,2:pgslave2::::"
                      # in format num:host:port:weight:data_directory:flag[,...]
                      # defaults:
                      #   port: 5432
                      #   weight: 1
                      #   data_directory: /var/lib/postgresql/data
                      #   flag: ALLOW_TO_FAILOVER
            REQUIRE_MIN_BACKENDS: 3 # minimal number of backends to start pgpool (some might be unreachable)
            CONFIGS: "num_init_children:250,max_pool:4,client_idle_limit:900,connection_life_time:300"
                      # in format variable1:value1[,variable2:value2[,...]] if CONFIGS_DELIMITER_SYMBOL=, and CONFIGS_ASSIGNMENT_SYMBOL=:
                      # used for pgpool.conf file
        expose:
            - 5432
            - 9898
        networks:
            cluster:
                aliases:
                    - pgpool

networks:
    cluster:
        driver: bridge

volumes:
    pgmaster:
    pgslave1:
    pgslave2:

Add some data to the DB.
docker-compose stop pgmaster and wait few seconds.
docker-compose start pgmaster

Then every time we start the pgmaster, this container will be stoped with this error:

...
2019-01-23 13:33:16.995 UTC [266] LOG:  database system is ready to accept connections
>>>>>> RECOVERY_WAL_ID is empty!
>>> Not in recovery state (anymore)
>>> Waiting for local postgres server start...
>>> Wait schema replication_db.public on pgmaster:5432(user: replication_user,password: *******), will try 60 times with delay 10 seconds (TIMEOUT=600)
>>>>>> Schema replication_db.public exists on host pgmaster:5432!
>>> Unregister the node if it was done before
DELETE 0
>>> Registering node with role standby
INFO: connecting to local node "node1" (ID: 1)
ERROR: this node should be a standby (user=replication_user password=replication_pass host=pgmaster dbname=replication_db port=5432 connect_timeout=2)
>>> Starting repmgr daemon...
[2019-01-23 13:33:47] [NOTICE] repmgrd (repmgr 4.0.6) starting up
[2019-01-23 13:33:47] [INFO] connecting to database "user=replication_user password=replication_pass host=pgmaster dbname=replication_db port=5432 connect_timeout=2"
INFO: looking for configuration file in /etc
INFO: configuration file found at: "/etc/repmgr.conf"
[2019-01-23 13:33:47] [ERROR] this node is marked as inactive and cannot be used as a failover target
[2019-01-23 13:33:47] [HINT] Check that "repmgr (primary|standby) register" was executed for this node
[2019-01-23 13:33:47] [INFO] executing notification command for event "repmgrd_shutdown"
[2019-01-23 13:33:47] [DETAIL] command is:
  /usr/local/bin/cluster/repmgr/events/router.sh 1 repmgrd_shutdown 0 "2019-01-23 13:33:47+0000" "node is inactive and cannot be used as a failover target"
[2019-01-23 13:33:47] [INFO] repmgrd terminating...

Jan 23 '19 14:01 mohsenasm

Hi mohsenasm.

Same issue. You fixed? Please help me!!!

Jun 21 '19 01:06 thien281087

Hi thien281087,

Nope, nothing changed.

Jun 21 '19 05:06 mohsenasm

Hi,

It seems that in repmgr4.0 these actions are no long there....

repmgr standby archive-config --config-archive-dir=/tmp/repmgr-archive
repmgr standby restore-config -D $PGDATA --config-archive-dir=/tmp/repmgr-archive

>>>>>> Archiving configs
The following command line errors were encountered:
  unknown repmgr action 'standby archive-config'
Try "repmgr --help" for more information.

>>>>>> Restoring configs
The following command line errors were encountered:
  unknown repmgr action 'standby restore-config'
Try "repmgr --help" for more information.

So basically when the master fails it returns back as master again...

And also, folder pg_xlog is renamed to pg_wal....

    echo ">>>>>> Start server to be able to rewind (weird hack to avoid dirty shutdown issue)"
    rm -rf $PGDATA/pg_xlog/archive_status/

Did anyone found a solution for this?

Jul 12 '19 12:07 hrvatskibogmars

Hi, also same issue me too(k8s/example2-single-statefulset), i referred to your says and found resolve

@msamichev

Hello.

Same issue. In postgres-10_pgpool-3.7_barman-2.4.yml it works.

I found some more errors (in the log) in new build (postgres-10_repmgr-4.0_pgpool-3.7_barman-2.4.yml) when restart the master. Maybe they are connected.

During command(file do_rewind) "gosu postgres repmgr standby archive-config --config-archive-dir=/tmp/repmgr-archive"
>>>>>> Archiving configs
The following command line errors were encountered:
  unknown repmgr action 'standby archive-config'
Try "repmgr --help" for more information.
In postgres-10_pgpool-3.7_barman-2.4.yml logs was:
>>>>>> Archiving configs
ERROR: connection to database failed: could not connect to server: Connection refused
        Is the server running on host "pgmaster" (172.19.0.4) and accepting
        TCP/IP connections on port 5432?
During command(file do_rewind) "gosu postgres repmgr standby restore-config -D $PGDATA --config-archive-dir=/tmp/repmgr-archive"
>>>>>> Restoring configs
The following command line errors were encountered:
  unknown repmgr action 'standby restore-config'
Try "repmgr --help" for more information.
During command(file do_rewind) "gosu postgres repmgr -h $CURRENT_REPLICATION_PRIMARY_HOST -p $REPLICATION_PRIMARY_PORT -d $REPLICATION_DB -U $REPLICATION_USER -D $PGDATA standby follow -W --log-level DEBUG --verbose"
>>>>>> Tell repmgr to follow upstream for the node
DEBUG: do_standby_follow()
DEBUG: connecting to: "user=replication_user password=replication_pass connect_timeout=2 dbname=replication_db host=pgmaster port=5432 fallback_application_name=repmgr"
ERROR: connection to database failed:
  could not connect to server: Connection refused
        Is the server running on host "pgmaster" (172.19.0.2) and accepting
        TCP/IP connections on port 5432?

DETAIL: attempted to connect using:
  user=replication_user password=replication_pass connect_timeout=2 dbname=replication_db host=pgmaster port=5432 fallback_application_name=repmgr
HINT: use "repmgr node rejoin" to re-add an inactive node to the replication cluster
INFO: looking for configuration file in /etc
INFO: configuration file found at: "/etc/repmgr.conf"
In the do_rewind file, you are trying to execute a command "rm -rf $PGDATA/pg_xlog/archive_status/", but in the postgres10 folder pg_xlog is no longer there. It is renamed pg_wal.

Looking at the log when restarting the masternode, i found error at repmgr. I saw that the creator of this project had updated several times, so I thought it would have worked fine before, and I lowered the version of repmgr, postgres, and pgpool. (Although this is not an exact solution, I couldn't change the docker file, so i tried lowering the version)

(refer site https://hub.docker.com/r/postdock/pgpool https://hub.docker.com/r/postdock/postgres)

change file node.yml pgpool.yml

-orgin tag postgres:latest-postgres11-repmgr40 pgpool:latest-pgpool37-postgres11

-changed tag. postgres:latest-postgrese10-repmgr32 pgpool:latest-pgpool37-postgres10 => it fine work.

Thank you

Nov 05 '20 07:11 eunbok

having the same issue here... @WnP 's solution might work but it's a PITA to do every time a node is killed.

I will try to downgrade to Postgres-10-Repmgr-3.2 to see if it works

Feb 18 '21 02:02 dwjorgeb

Hitting same problem. Unable to use repmgr rejoin node as well as I get errors like:

ERROR: database is still running in state "in production"
HINT: "repmgr node rejoin" cannot be executed on a running node
INFO: looking for configuration file in /etc
INFO: configuration file found at: "/etc/repmgr.conf"

But how can you run the rejoin command if the container isn't up?

Feb 21 '21 01:02 webdobe

PostDock PostDock copied to clipboard

pgmaster fails after restart

PostDock
PostDock copied to clipboard