cluster shutdown after succesfull build from latest.yml
Hello! Thanks for the repon it will be very useful!
An attempt to run a cluster through docker-compose
docker-compose -f ./docker-compose/latest.yml up pgmaster pgslave1 pgslave2 pgslave3 pgslave4 pgpool backup
ends with errors like:
pgslave3_1 | psql: could not connect to server: Connection refused
pgslave3_1 | Is the server running on host "pgslave3" (172.20.0.3) and accepting
pgslave3_1 | TCP / IP connections port 5432?
Everything seems to be configured correctly, below I paste the logs from the cluster start.
Attaching to dockercompose_pgslave2_1, dockercompose_pgslave3_1, dockercompose_pgpool_1, dockercompose_backup_1, dockercompose_pgmaster_1, dockercompose_pgslave4_1, dockercompose_pgslave1_1
pgslave2_1 | >>> Setting up STOP handlers...
pgslave2_1 | >>> STARTING SSH (if required)...
pgslave2_1 | cp: cannot stat '/home/postgres/.ssh/keys/*': No such file or directory
pgslave2_1 | No pre-populated ssh keys!
pgslave2_1 | >>> SSH is not enabled!
pgslave2_1 | >>> STARTING POSTGRES...
pgslave2_1 | >>> SETTING UP POLYMORPHIC VARIABLES (repmgr=3+postgres=9 | repmgr=4, postgres=10)...
pgslave2_1 | >>> TUNING UP POSTGRES...
pgslave2_1 | >>> Cleaning data folder which might have some garbage...
pgslave2_1 | >>> Auto-detected master name: ''
pgslave2_1 | >>> Setting up repmgr...
pgslave2_1 | >>> Setting up repmgr config file '/etc/repmgr.conf'...
pgslave2_1 | >>> Setting up upstream node...
pgslave2_1 | cat: /var/lib/postgresql/data/standby.lock: No such file or directory
pgslave2_1 | >>> Previously Locked standby upstream node LOCKED_STANDBY=''
pgslave2_1 | >>> Waiting for upstream postgres server...
pgslave3_1 | >>> Setting up STOP handlers...
pgslave2_1 | >>> Wait schema replication_db.repmgr on pgslave1:5432(user: replication_user,password: *******), will try 30 times with delay 10 seconds (TIMEOUT=300)
pgslave3_1 | >>> STARTING SSH (if required)...
pgslave2_1 | >>>>>> Host pgslave1:5432 is not accessiblepsql: could not connect to server: No route to host
pgslave2_1 | Is the server running on host "pgslave1" (172.20.0.8) and accepting
pgslave2_1 | TCP/IP connections on port 5432?
pgpool_1 | >>> STARTING SSH (if required)...
pgpool_1 | >>> TUNING UP SSH CLIENT...
pgpool_1 | >>> STARTING SSH SERVER...
pgpool_1 | >>> TURNING PGPOOL...
pgpool_1 | >>> Opening access from all hosts by md5 in /usr/local/etc/pool_hba.conf
pgslave3_1 | cp: cannot stat '/home/postgres/.ssh/keys/*': No such file or directory
pgpool_1 | >>> Adding user pcp_user for PCP
pgpool_1 | >>> Creating a ~/.pcppass file for pcp_user
pgpool_1 | >>> Adding users for md5 auth
pgpool_1 | >>>>>> Adding user monkey_user
pgpool_1 | >>> Adding check user 'monkey_user' for md5 auth
pgpool_1 | >>> Adding user 'monkey_user' as check user
pgpool_1 | >>> Adding user 'monkey_user' as health-check user
pgpool_1 | >>> Adding backends
backup_1 | >>> Checking all configurations
backup_1 | >>> Configuring barman for streaming replication
pgslave3_1 | No pre-populated ssh keys!
pgslave3_1 | >>> SSH is not enabled!
pgslave3_1 | >>> STARTING POSTGRES...
backup_1 | >>> STARTING SSH (if required)...
backup_1 | >>> TUNING UP SSH CLIENT...
backup_1 | >>> STARTING SSH SERVER...
backup_1 | >>> SETUP BARMAN CRON
backup_1 | >>>>>> Backup schedule is */30 */5 * * *
backup_1 | >>> STARTING METRICS SERVER
backup_1 | >>> STARTING CRON
pgslave3_1 | >>> SETTING UP POLYMORPHIC VARIABLES (repmgr=3+postgres=9 | repmgr=4, postgres=10)...
pgslave3_1 | >>> TUNING UP POSTGRES...
pgslave3_1 | >>> Cleaning data folder which might have some garbage...
pgslave3_1 | >>> Check all partner nodes for common upstream node...
pgslave3_1 | >>>>>> Checking NODE=pgmaster...
pgslave3_1 | psql: could not connect to server: No route to host
pgslave3_1 | Is the server running on host "pgmaster" (172.20.0.6) and accepting
pgslave3_1 | TCP/IP connections on port 5432?
pgpool_1 | >>>>>> Waiting for backend 0 to start pgpool (WAIT_BACKEND_TIMEOUT=60)
pgpool_1 | 2018/09/13 13:57:08 Waiting for host: tcp://pgmaster:5432
pgslave3_1 | >>>>>> Skipping: failed to get master from the node!
pgslave3_1 | >>>>>> Checking NODE=pgslave1...
pgslave3_1 | psql: could not connect to server: Connection refused
pgslave3_1 | Is the server running on host "pgslave1" (172.20.0.8) and accepting
pgslave3_1 | TCP/IP connections on port 5432?
pgslave3_1 | >>>>>> Skipping: failed to get master from the node!
pgslave3_1 | >>>>>> Checking NODE=pgslave3...
pgslave3_1 | psql: could not connect to server: Connection refused
pgslave3_1 | Is the server running on host "pgslave3" (172.20.0.3) and accepting
pgslave3_1 | TCP/IP connections on port 5432?
pgslave3_1 | >>>>>> Skipping: failed to get master from the node!
pgslave3_1 | >>> Auto-detected master name: ''
pgslave3_1 | >>> Setting up repmgr...
pgslave3_1 | >>> Setting up repmgr config file '/etc/repmgr.conf'...
pgmaster_1 | >>> Setting up STOP handlers...
pgmaster_1 | >>> STARTING SSH (if required)...
pgmaster_1 | >>> TUNING UP SSH CLIENT...
pgslave4_1 | >>> Setting up STOP handlers...
pgslave4_1 | >>> STARTING SSH (if required)...
pgslave4_1 | cp: cannot stat '/home/postgres/.ssh/keys/*': No such file or directory
pgmaster_1 | >>> STARTING SSH SERVER...
pgslave4_1 | No pre-populated ssh keys!
pgmaster_1 | >>> STARTING POSTGRES...
pgslave4_1 | >>> SSH is not enabled!
pgslave4_1 | >>> STARTING POSTGRES...
pgslave1_1 | >>> Setting up STOP handlers...
pgmaster_1 | >>> SETTING UP POLYMORPHIC VARIABLES (repmgr=3+postgres=9 | repmgr=4, postgres=10)...
pgslave1_1 | >>> STARTING SSH (if required)...
pgmaster_1 | >>> TUNING UP POSTGRES...
pgmaster_1 | >>> Configuring /var/lib/postgresql/data/postgresql.conf
pgmaster_1 | >>>>>> Will add configs to the exists file
pgslave4_1 | >>> SETTING UP POLYMORPHIC VARIABLES (repmgr=3+postgres=9 | repmgr=4, postgres=10)...
pgslave4_1 | >>> TUNING UP POSTGRES...
pgmaster_1 | >>>>>> Adding config 'listen_addresses'=''*''
pgslave1_1 | >>> TUNING UP SSH CLIENT...
pgmaster_1 | >>>>>> Adding config 'max_replication_slots'='5'
pgslave4_1 | >>> Cleaning data folder which might have some garbage...
pgmaster_1 | >>>>>> Adding config 'shared_preload_libraries'=''repmgr''
pgslave1_1 | >>> STARTING SSH SERVER...
pgslave4_1 | >>> Auto-detected master name: ''
pgmaster_1 | >>> Check all partner nodes for common upstream node...
pgmaster_1 | >>>>>> Checking NODE=pgmaster...
pgmaster_1 | psql: could not connect to server: Connection refused
...
pgslave3_1 | >>>>>> Host pgmaster:5432 is not accessible (will try 1 times more)
pgslave3_1 | >>> Schema replication_db.repmgr is not accessible, even after 30 tries!
pgslave1_1 | >>>>>> Host pgmaster:5432 is not accessible (will try 1 times more)
pgslave1_1 | >>> Schema replication_db.repmgr is not accessible, even after 30 tries!
dockercompose_pgslave3_1 exited with code 1
dockercompose_pgslave1_1 exited with code 1
backup_1 | 2018-09-13 14:03:01,190 [79] barman.config DEBUG: Including configuration file: upstream.conf
backup_1 | 2018-09-13 14:03:01,191 [79] barman.cli DEBUG: Initialised Barman version 2.4 (config: /etc/barman.conf, args: {'server_name': ['pg_cluster'], 'format': 'console', 'quiet': False, 'command': 'show_server', 'debug': False})
backup_1 | 2018-09-13 14:03:01,205 [79] barman.backup_executor DEBUG: The default backup strategy for postgres backup_method is: concurrent_backup
backup_1 | 2018-09-13 14:03:01,205 [79] barman.server DEBUG: Retention policy for server pg_cluster: RECOVERY WINDOW OF 30 DAYS
backup_1 | 2018-09-13 14:03:01,205 [79] barman.server DEBUG: WAL retention policy for server pg_cluster: MAIN
backup_1 | 2018-09-13 14:03:01,207 [79] barman.postgres WARNING: Error retrieving PostgreSQL status: could not translate host name "pgmaster" to address: Name or service not known
backup_1 | 2018-09-13 14:03:01,208 [79] barman.postgres WARNING: Error retrieving PostgreSQL status: could not translate host name "pgmaster" to address: Name or service not known
backup_1 | 2018-09-13 14:03:01,208 [79] barman.command_wrappers DEBUG: Command: ['/usr/bin/pg_receivewal', '--version']
backup_1 | 2018-09-13 14:03:01,567 [79] barman.command_wrappers DEBUG: Command return code: 0
backup_1 | 2018-09-13 14:03:01,567 [79] barman.command_wrappers DEBUG: Command stdout: pg_receivewal (PostgreSQL) 10.5 (Debian 10.5-1.pgdg80+1)
backup_1 |
backup_1 | 2018-09-13 14:03:01,567 [79] barman.command_wrappers DEBUG: Command stderr:
backup_1 | 2018-09-13 14:03:01,569 [79] barman.postgres DEBUG: Error retrieving PostgreSQL version: could not translate host name "pgmaster" to address: Name or service not known
backup_1 | 2018-09-13 14:03:01,569 [79] barman.command_wrappers DEBUG: Command: ['/usr/bin/pg_basebackup', '--version']
backup_1 | 2018-09-13 14:03:01,936 [79] barman.command_wrappers DEBUG: Command return code: 0
backup_1 | 2018-09-13 14:03:01,936 [79] barman.command_wrappers DEBUG: Command stdout: pg_basebackup (PostgreSQL) 10.5 (Debian 10.5-1.pgdg80+1)
backup_1 |
backup_1 | 2018-09-13 14:03:01,936 [79] barman.command_wrappers DEBUG: Command stderr:
backup_1 | 2018-09-13 14:03:01,938 [79] barman.postgres DEBUG: Error retrieving PostgreSQL version: could not translate host name "pgmaster" to address: Name or service not known
backup_1 | Creating replication slot: barman_the_backupper
backup_1 | 2018-09-13 14:03:02,037 [84] barman.config DEBUG: Including configuration file: upstream.conf
backup_1 | 2018-09-13 14:03:02,037 [84] barman.cli DEBUG: Initialised Barman version 2.4 (config: /etc/barman.conf, args: {'reset': False, 'server_name': 'pg_cluster', 'format': 'console', 'stop': False, 'create_slot': True, 'quiet': False, 'drop_slot': False, 'command': 'receive_wal', 'debug': False})
backup_1 | 2018-09-13 14:03:02,051 [84] barman.backup_executor DEBUG: The default backup strategy for postgres backup_method is: concurrent_backup
backup_1 | 2018-09-13 14:03:02,051 [84] barman.server DEBUG: Retention policy for server pg_cluster: RECOVERY WINDOW OF 30 DAYS
backup_1 | 2018-09-13 14:03:02,051 [84] barman.server DEBUG: WAL retention policy for server pg_cluster: MAIN
backup_1 | ERROR: Cannot connect to server 'pg_cluster'
backup_1 | 2018-09-13 14:03:02,053 [84] barman.server ERROR: Cannot connect to server 'pg_cluster': could not translate host name "pgmaster" to address: Name or service not known
^CGracefully stopping... (press Ctrl+C again to force)
Stopping dockercompose_backup_1 ... done
Have a nice day
well I don;t really see what happens with master node as you replaced it with ...
Did it start?..
I apologize for this unfortunate pasting of logs. The master started, but after a moment of attempts, the entire cluster turned off, returning "connection refused" errors to the node's master. I solved this problem by changing the configuration of the file (src/pgsql/bin/postgres/primary/entrypoint.sh):
diff --git a/src/pgsql/bin/postgres/primary/entrypoint.sh b/src/pgsql/bin/postgres/primary/entrypoint.sh
index b8451f5..030cbc7 100755
--- a/src/pgsql/bin/postgres/primary/entrypoint.sh
+++ b/src/pgsql/bin/postgres/primary/entrypoint.sh
@@ -3,11 +3,11 @@ set -e
FORCE_RECONFIGURE=1 postgres_configure
...
echo ">>> Creating replication db '$REPLICATION_DB'"
-createdb $REPLICATION_DB -O $REPLICATION_USER
+createdb -U "${POSTGRES_USER}" "${REPLICATION_DB}" -O "${REPLICATION_USER}"
...
-echo "host replication $REPLICATION_USER 0.0.0.0/0 md5" >> $PGDATA/pg_hba.conf
+echo "host replication $REPLICATION_USER 0.0.0.0/0 trust" >> $PGDATA/pg_hba.conf
Now I am trying to solve the problem, why node 2 and 4 can not connect to the cluster and replicate data of 1 and 3 nodes. Below log:
pgmaster_1 | INFO: executing notification command for event "primary_register"
pgmaster_1 | DETAIL: command is:
pgmaster_1 | /usr/local/bin/cluster/repmgr/events/router.sh 1 primary_register 1 "2018-09-17 12:42:01.885953+00" ""
pgmaster_1 | [REPMGR EVENT] Node id: 1; Event type: primary_register; Success [1|0]: 1; Time: 2018-09-17 12:42:01.885953+00; Details:
pgmaster_1 | NOTICE: primary node record (id: 1) registered
pgmaster_1 | >>> Starting repmgr daemon...
pgslave3_1 | >>>>>> Schema replication_db.repmgr exists on host pgmaster:5432!
pgmaster_1 | [2018-09-17 12:42:01] [NOTICE] repmgrd (repmgr 4.0.6) starting up
pgmaster_1 | INFO: looking for configuration file in /etc
pgmaster_1 | INFO: configuration file found at: "/etc/repmgr.conf"
pgmaster_1 | [2018-09-17 12:42:01] [INFO] connecting to database "user=replication_user password=replication_pass host=pgmaster dbname=replication_db port=5432 connect_timeout=2"
pgmaster_1 | [2018-09-17 12:42:01] [NOTICE] starting monitoring of node "node1" (ID: 1)
pgmaster_1 | [2018-09-17 12:42:01] [INFO] executing notification command for event "repmgrd_start"
pgmaster_1 | [2018-09-17 12:42:01] [DETAIL] command is:
pgmaster_1 | /usr/local/bin/cluster/repmgr/events/router.sh 1 repmgrd_start 1 "2018-09-17 12:42:01.962474+00" "monitoring cluster primary \"node1\" (node ID: 1)"
pgmaster_1 | [2018-09-17 12:42:01] [NOTICE] monitoring cluster primary "node1" (node ID: 1)
pgslave3_1 | >>> REPLICATION_UPSTREAM_NODE_ID=1
pgslave3_1 | >>> Sending in background postgres start...
pgslave3_1 | >>> Waiting for upstream postgres server...
pgslave3_1 | >>> Wait schema replication_db.repmgr on pgmaster:5432(user: replication_user,password: *******), will try 30 times with delay 10 seconds (TIMEOUT=300)
backup_1 | 2018-09-17 12:42:02,052 [33] barman.command_wrappers DEBUG: Command return code: 0
backup_1 | 2018-09-17 12:42:02,052 [33] barman.command_wrappers DEBUG: Command stdout: pg_receivewal (PostgreSQL) 10.5 (Debian 10.5-1.pgdg80+1)
backup_1 |
backup_1 | 2018-09-17 12:42:02,052 [33] barman.command_wrappers DEBUG: Command stderr:
backup_1 | 2018-09-17 12:42:02,054 [33] barman.wal_archiver DEBUG: Look for 'barman_receive_wal' in 'synchronous_standby_names': ['']
backup_1 | 2018-09-17 12:42:02,054 [33] barman.wal_archiver DEBUG: Synchronous WAL streaming for barman_receive_wal: False
backup_1 | 2018-09-17 12:42:02,054 [33] barman.command_wrappers DEBUG: Command: ['/usr/bin/pg_basebackup', '--version']
pgslave3_1 | >>>>>> Schema replication_db.repmgr exists on host pgmaster:5432!
pgslave3_1 | >>> Starting standby node...
pgslave3_1 | >>> Instance hasn't been set up yet.
pgslave3_1 | >>> Clonning primary node...
pgslave3_1 | >>> Waiting for upstream postgres server...
pgslave3_1 | >>> Wait schema replication_db.repmgr on pgmaster:5432(user: replication_user,password: *******), will try 30 times with delay 10 seconds (TIMEOUT=300)
pgslave3_1 | NOTICE: destination directory "/var/lib/postgresql/data" provided
pgslave3_1 | INFO: connecting to source node
pgslave3_1 | DETAIL: connection string is: host=pgmaster user=replication_user port=5432 dbname=replication_db
pgslave3_1 | DETAIL: current installation size is 37 MB
pgslave3_1 | INFO: checking and correcting permissions on existing directory "/var/lib/postgresql/data"
pgslave3_1 | NOTICE: >>>>>> Schema replication_db.repmgr exists on host pgmaster:5432!
pgslave3_1 | starting backup (using pg_basebackup)...
pgslave3_1 | INFO: executing:
pgslave3_1 | /usr/lib/postgresql/10/bin/pg_basebackup -l "repmgr base backup" -D /var/lib/postgresql/data -h pgmaster -p 5432 -U replication_user -c fast -X stream -S repmgr_slot_4
pgslave3_1 | >>> Waiting for cloning on this node is over(if any in progress): CLEAN_UP_ON_FAIL=, INTERVAL=30
pgslave3_1 | >>> Replicated: 4
backup_1 | 2018-09-17 12:42:02,441 [33] barman.command_wrappers DEBUG: Command return code: 0
backup_1 | 2018-09-17 12:42:02,442 [33] barman.command_wrappers DEBUG: Command stdout: pg_basebackup (PostgreSQL) 10.5 (Debian 10.5-1.pgdg80+1)
backup_1 |
backup_1 | 2018-09-17 12:42:02,442 [33] barman.command_wrappers DEBUG: Command stderr:
backup_1 | Creating replication slot: barman_the_backupper
backup_1 | 2018-09-17 12:42:02,562 [38] barman.config DEBUG: Including configuration file: upstream.conf
backup_1 | 2018-09-17 12:42:02,562 [38] barman.cli DEBUG: Initialised Barman version 2.4 (config: /etc/barman.conf, args: {'reset': False, 'server_name': 'pg_cluster', 'format': 'console', 'stop': False, 'create_slot': True, 'quiet': False, 'drop_slot': False, 'command': 'receive_wal', 'debug': False})
backup_1 | 2018-09-17 12:42:02,576 [38] barman.backup_executor DEBUG: The default backup strategy for postgres backup_method is: concurrent_backup
backup_1 | 2018-09-17 12:42:02,577 [38] barman.server DEBUG: Retention policy for server pg_cluster: RECOVERY WINDOW OF 30 DAYS
backup_1 | 2018-09-17 12:42:02,577 [38] barman.server DEBUG: WAL retention policy for server pg_cluster: MAIN
backup_1 | 2018-09-17 12:42:02,579 [38] barman.server INFO: Creating physical replication slot 'barman_the_backupper' on server 'pg_cluster'
backup_1 | 2018-09-17 12:42:02,632 [38] barman.server INFO: Replication slot 'barman_the_backupper' created
backup_1 | Creating physical replication slot 'barman_the_backupper' on server 'pg_cluster'
backup_1 | Replication slot 'barman_the_backupper' created
backup_1 | 2018-09-17 12:42:02,739 [39] barman.config DEBUG: Including configuration file: upstream.conf
backup_1 | 2018-09-17 12:42:02,740 [39] barman.cli DEBUG: Initialised Barman version 2.4 (config: /etc/barman.conf, args: {'debug': False, 'command': 'cron', 'quiet': False, 'format': 'console'})
backup_1 | 2018-09-17 12:42:02,754 [39] barman.backup_executor DEBUG: The default backup strategy for postgres backup_method is: concurrent_backup
backup_1 | 2018-09-17 12:42:02,754 [39] barman.server DEBUG: Retention policy for server pg_cluster: RECOVERY WINDOW OF 30 DAYS
backup_1 | 2018-09-17 12:42:02,754 [39] barman.server DEBUG: WAL retention policy for server pg_cluster: MAIN
backup_1 | 2018-09-17 12:42:02,754 [39] barman.command_wrappers DEBUG: BarmanSubProcess: ['/usr/bin/python', '/usr/bin/barman', '-c', '/etc/barman.conf', '-q', 'archive-wal', 'pg_cluster']
pgslave3_1 | NOTICE: standby clone (using pg_basebackup) complete
pgslave3_1 | NOTICE: you can now start your PostgreSQL server
pgslave3_1 | HINT: for example: pg_ctl -D /var/lib/postgresql/data start
pgslave3_1 | HINT: after starting the server, you need to register this standby with "repmgr standby register"
pgslave3_1 | INFO: executing notification command for event "standby_clone"
pgslave3_1 | DETAIL: command is:
pgslave3_1 | /usr/local/bin/cluster/repmgr/events/router.sh 4 standby_clone 1 "2018-09-17 12:42:02.833449+00" "cloned from host \"pgmaster\", port 5432; backup method: pg_basebackup; --force: Y"
pgslave3_1 | [REPMGR EVENT] Node id: 4; Event type: standby_clone; Success [1|0]: 1; Time: 2018-09-17 12:42:02.833449+00; Details: cloned from host "pgmaster", port 5432; backup method: pg_basebackup; --force: Y
pgslave3_1 | >>> Configuring /var/lib/postgresql/data/postgresql.conf
pgslave3_1 | >>>>>> Will add configs to the exists file
pgslave3_1 | >>>>>> Adding config 'listen_addresses'=''*''
pgslave3_1 | >>>>>> Adding config 'shared_preload_libraries'=''repmgr''
pgslave3_1 | >>> Starting postgres...
pgslave3_1 | >>> Waiting for local postgres server recovery if any in progress:LAUNCH_RECOVERY_CHECK_INTERVAL=30
pgslave3_1 | >>> Recovery is in progress:
pgslave3_1 | 2018-09-17 12:42:02.928 UTC [168] LOG: listening on IPv4 address "0.0.0.0", port 5432
pgslave3_1 | 2018-09-17 12:42:02.929 UTC [168] LOG: listening on IPv6 address "::", port 5432
pgslave3_1 | 2018-09-17 12:42:02.934 UTC [168] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
pgslave3_1 | 2018-09-17 12:42:02.954 UTC [177] LOG: database system was interrupted; last known up at 2018-09-17 12:42:02 UTC
pgslave3_1 | 2018-09-17 12:42:03.029 UTC [177] LOG: entering standby mode
pgslave3_1 | 2018-09-17 12:42:03.036 UTC [177] LOG: redo starts at 0/2000028
pgslave3_1 | 2018-09-17 12:42:03.039 UTC [177] LOG: consistent recovery state reached at 0/20000F8
pgslave3_1 | 2018-09-17 12:42:03.039 UTC [168] LOG: database system is ready to accept read only connections
pgslave3_1 | 2018-09-17 12:42:03.051 UTC [181] LOG: started streaming WAL from primary at 0/3000000 on timeline 1
backup_1 | 2018-09-17 12:42:03,112 [39] barman.command_wrappers DEBUG: BarmanSubProcess: subprocess started. pid: 40
backup_1 | 2018-09-17 12:42:03,113 [39] barman.command_wrappers DEBUG: BarmanSubProcess: ['/usr/bin/python', '/usr/bin/barman', '-c', '/etc/barman.conf', '-q', 'receive-wal', 'pg_cluster']
backup_1 | 2018-09-17 12:42:03,449 [39] barman.command_wrappers DEBUG: BarmanSubProcess: subprocess started. pid: 41
backup_1 | Starting WAL archiving for server pg_cluster
backup_1 | Starting streaming archiver for server pg_cluster
pgslave2_1 | >>>>>> Host pgslave1:5432 is not accessible (will try 27 times more)
pgslave2_1 | psql: could not connect to server: Connection refused
pgslave2_1 | Is the server running on host "pgslave1" (192.168.112.5) and accepting
pgslave2_1 | TCP/IP connections on port 5432?
pgslave4_1 | >>>>>> Host pgslave3:5432 is not accessible (will try 27 times more)
pgslave4_1 | >>>>>> Schema replication_db.repmgr exists on host pgslave3:5432!
pgslave4_1 | >>> Can not get REPLICATION_UPSTREAM_NODE_ID from LOCK file or by CURRENT_REPLICATION_PRIMARY_HOST=pgslave3
dockercompose_pgslave4_1 exited with code 1
pgslave1_1 | >>>>>> Schema replication_db.repmgr is still not accessible on host pgmaster:5432 (will try 27 times more)
pgslave1_1 | >>>>>> Schema replication_db.repmgr exists on host pgmaster:5432!
pgslave1_1 | >>> REPLICATION_UPSTREAM_NODE_ID=1
pgslave1_1 | >>> Sending in background postgres start...
pgslave1_1 | >>> Waiting for upstream postgres server...
pgslave1_1 | >>> Wait schema replication_db.repmgr on pgmaster:5432(user: replication_user,password: *******), will try 30 times with delay 10 seconds (TIMEOUT=300)
pgslave1_1 | >>>>>> Schema replication_db.repmgr exists on host pgmaster:5432!
pgslave1_1 | >>> Starting standby node...
pgslave1_1 | >>> Instance hasn't been set up yet.
pgslave1_1 | >>> Clonning primary node...
pgslave1_1 | >>> Waiting for upstream postgres server...
pgslave1_1 | >>> Wait schema replication_db.repmgr on pgmaster:5432(user: replication_user,password: *******), will try 30 times with delay 10 seconds (TIMEOUT=300)
pgslave1_1 | NOTICE: destination directory "/var/lib/postgresql/data" provided
pgslave1_1 | INFO: connecting to source node
pgslave1_1 | DETAIL: connection string is: host=pgmaster user=replication_user port=5432 dbname=replication_db
pgslave1_1 | DETAIL: current installation size is 37 MB
pgslave1_1 | >>>>>> Schema replication_db.repmgr exists on host pgmaster:5432!
pgslave1_1 | >>> Waiting for cloning on this node is over(if any in progress): CLEAN_UP_ON_FAIL=, INTERVAL=30
pgslave1_1 | INFO: checking and correcting permissions on existing directory "/var/lib/postgresql/data"
pgslave1_1 | >>> Replicated: 4
pgslave1_1 | NOTICE: starting backup (using pg_basebackup)...
pgslave1_1 | INFO: executing:
pgslave1_1 | /usr/lib/postgresql/10/bin/pg_basebackup -l "repmgr base backup" -D /var/lib/postgresql/data -h pgmaster -p 5432 -U replication_user -c fast -X stream -S repmgr_slot_2
pgslave1_1 | NOTICE: standby clone (using pg_basebackup) complete
pgslave1_1 | NOTICE: you can now start your PostgreSQL server
pgslave1_1 | HINT: for example: pg_ctl -D /var/lib/postgresql/data start
pgslave1_1 | HINT: after starting the server, you need to register this standby with "repmgr standby register"
pgslave1_1 | INFO: executing notification command for event "standby_clone"
pgslave1_1 | DETAIL: command is:
pgslave1_1 | /usr/local/bin/cluster/repmgr/events/router.sh 2 standby_clone 1 "2018-09-17 12:42:12.787654+00" "cloned from host \"pgmaster\", port 5432; backup method: pg_basebackup; --force: Y"
pgslave1_1 | [REPMGR EVENT] Node id: 2; Event type: standby_clone; Success [1|0]: 1; Time: 2018-09-17 12:42:12.787654+00; Details: cloned from host "pgmaster", port 5432; backup method: pg_basebackup; --force: Y
pgslave1_1 | >>> Configuring /var/lib/postgresql/data/postgresql.conf
pgslave1_1 | >>>>>> Will add configs to the exists file
pgslave1_1 | >>>>>> Adding config 'max_replication_slots'='10'
pgslave1_1 | >>>>>> Adding config 'shared_preload_libraries'=''repmgr''
pgslave1_1 | >>> Starting postgres...
pgslave1_1 | >>> Waiting for local postgres server recovery if any in progress:LAUNCH_RECOVERY_CHECK_INTERVAL=30
pgslave1_1 | >>> Recovery is in progress:
pgslave1_1 | 2018-09-17 12:42:12.954 UTC [184] LOG: listening on IPv4 address "0.0.0.0", port 5432
pgslave1_1 | 2018-09-17 12:42:12.954 UTC [184] LOG: listening on IPv6 address "::", port 5432
pgpool_1 | 2018/09/17 12:42:12 Connected to tcp://pgslave1:5432
pgpool_1 | >>>>>> Adding backend 1
pgpool_1 | >>>>>> Waiting for backend 3 to start pgpool (WAIT_BACKEND_TIMEOUT=60)
pgpool_1 | 2018/09/17 12:42:12 Waiting for host: tcp://pgslave3:5432
pgpool_1 | 2018/09/17 12:42:12 Connected to tcp://pgslave3:5432
pgslave3_1 | 2018-09-17 12:42:12.961 UTC [184] LOG: incomplete startup packet
pgpool_1 | >>>>>> Adding backend 3
pgpool_1 | >>>>>> Waiting for backend 2 to start pgpool (WAIT_BACKEND_TIMEOUT=60)
pgslave1_1 | 2018-09-17 12:42:12.963 UTC [184] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
pgpool_1 | 2018/09/17 12:42:12 Waiting for host: tcp://pgslave2:5432
pgslave1_1 | 2018-09-17 12:42:12.980 UTC [193] LOG: database system was interrupted; last known up at 2018-09-17 12:42:12 UTC
pgslave1_1 | 2018-09-17 12:42:12.980 UTC [194] LOG: incomplete startup packet
pgslave1_1 | 2018-09-17 12:42:13.044 UTC [193] LOG: entering standby mode
pgslave1_1 | 2018-09-17 12:42:13.051 UTC [193] LOG: redo starts at 0/4000028
pgslave1_1 | 2018-09-17 12:42:13.053 UTC [193] LOG: consistent recovery state reached at 0/40000F8
pgslave1_1 | 2018-09-17 12:42:13.053 UTC [184] LOG: database system is ready to accept read only connections
pgslave1_1 | 2018-09-17 12:42:13.059 UTC [198] LOG: started streaming WAL from primary at 0/5000000 on timeline 1
pgslave2_1 | >>>>>> Host pgslave1:5432 is not accessible (will try 26 times more)
pgslave2_1 | >>>>>> Schema replication_db.repmgr exists on host pgslave1:5432!
pgslave2_1 | >>> Can not get REPLICATION_UPSTREAM_NODE_ID from LOCK file or by CURRENT_REPLICATION_PRIMARY_HOST=pgslave1
dockercompose_pgslave2_1 exited with code 1
Anything new on this one? Unfortunately I have the same issue...
@paunin @czarny94 I believe the root cause is that the upstream postgres docker image "fixed" a "bug" where the postgres-user existed by accident. By fixing it they broke lots of builds depending on the postgres user existing, see: https://github.com/docker-library/postgres/issues/497
I think the suggested fix createdb -U "${POSTGRES_USER}" is a good start.
I gave it a go and got pgslave1 and pgslave2 up and running with these fixes: #194
Only changing createdb -U "${POSTGRES_USER}" is not enough since other places in the code base depends on using gosu to switch to the postgres unix-user, which must match a role in postgres.
I have the same issue with latest compose
Pgpool logs
>>> Opening access from all hosts by md5 in /usr/local/etc/pool_hba.conf
>>> Adding user pcp_user for PCP
>>> Creating a ~/.pcppass file for pcp_user
>>> Adding users for md5 auth
>>>>>> Adding user monkey_user
>>> Adding check user 'monkey_user' for md5 auth
>>> Adding user 'monkey_user' as check user
>>> Adding user 'monkey_user' as health-check user
>>> Adding backends
>>>>>> Waiting for backend 0 to start pgpool (WAIT_BACKEND_TIMEOUT=60)
2018/10/18 02:53:07 Waiting for host: tcp://pgmaster:5432
218/10/18 02:53:07 Connected to tcp://pgmaster:5432
>>>>>> Adding backend 0
>>>>>> Waiting for backend 1 to start pgpool (WAIT_BACKEND_TIMEOUT=60)
2018/10/18 02:53:07 Waiting for host: tcp://pgslave1:5432
2018/10/18 02:54:07 Timeout after 1m0s waiting on dependencies to become available: [tcp://pgslave1:5432]
>>>>> Will not add node 1 - it's unreachable!
>>>>> Waiting for backend 3 to start pgpool (WAIT_BACKEND_TIMEOUT=60)
2018/10/18 02:54:07 Waiting for host: tcp://pgslave3:5432
2018/10/18 02:55:07 Timeout after 1m0s waiting on dependencies to become available: [tcp://pgslave3:5432]
>>>>> Will not add node 3 - it's unreachable!
>>>>> Waiting for backend 2 to start pgpool (WAIT_BACKEND_TIMEOUT=60)
2018/10/18 02:55:07 Waiting for host: tcp://pgslave2:5432
2018/10/18 02:56:07 Timeout after 1m0s waiting on dependencies to become available: [tcp://pgslave2:5432]
>>>>>> Will not add node 2 - it's unreachable!
>>> Checking if we have enough backends to start
>>>>>> Can not start pgpool with REQUIRE_MIN_BACKENDS=3, BACKENDS_COUNT=1
PGmaster logs
>>>>> Host pgmaster:5432 is not accessible (will try 5 times more)
2018-10-18 03:26:09.323 UTC [206] FATAL: database "replication_db" does not exist
psql: FATAL: database "replication_db" does not exist
2018-10-18 03:26:09.639 UTC [208] FATAL: database "replication_db" does not exist
2018-10-18 03:26:19.287 UTC [209] FATAL: database "replication_db" does not exist
>>>>>> Host pgmaster:5432 is not accessible (will try 4 times more)
2018-10-18 03:26:19.395 UTC [219] FATAL: database "replication_db" does not exist
psql: FATAL: database "replication_db" does not exist
2018-10-18 03:26:19.714 UTC [221] FATAL: database "replication_db" does not exist
2018-10-18 03:26:29.374 UTC [222] FATAL: database "replication_db" does not exist
>>>>>> Host pgmaster:5432 is not accessible (will try 3 times more)
2018-10-18 03:26:29.439 UTC [232] FATAL: database "replication_db" does not exist
psql: FATAL: database "replication_db" does not exist
2018-10-18 03:26:29.769 UTC [234] FATAL: database "replication_db" does not exist
>>>>>> Host pgmaster:5432 is not accessible (will try 2 times more)
2018-10-18 03:26:39.447 UTC [237] FATAL: database "replication_db" does not exist
2018-10-18 03:26:39.513 UTC [245] FATAL: database "replication_db" does not exist
psql: FATAL: database "replication_db" does not exist
2018-10-18 03:26:39.825 UTC [247] FATAL: database "replication_db" does not exist
>>>>>> Host pgmaster:5432 is not accessible (will try 1 times more)
>>> Schema replication_db.public is not accessible, even after 9 tries!