pg_auto_failover icon indicating copy to clipboard operation
pg_auto_failover copied to clipboard

Unrecoverable state after failed switchover

Open Navid2zp opened this issue 4 years ago • 8 comments

Hello,

I have a 3 node setup, 1 primary, and 2 secondaries.

The primary node's server went down and another node got promoted to primary just fine. I tried a switchover after all servers became available again so the primary can be changed back to the server that was down before (more resources there). switchover failed and now neither of the nodes are primary and I don't know how to recover.

Postgresql version:

$ psql --version
psql (PostgreSQL) 13.2 (Ubuntu 13.2-1.pgdg20.04+1)

pg_auto_failover version:

$ pg_autoctl --version
pg_autoctl version 1.5.1
pg_autoctl extension version 1.5
compiled with PostgreSQL 13.2 (Ubuntu 13.2-1.pgdg20.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, 64-bit
compatible with Postgres 10, 11, 12, and 13

OS:

lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.2 LTS
Release:        20.04
Codename:       focal

switchover:

$ pg_autoctl perform switchover --pgdata /var/lib/postgresql/monitor --formation worker-3

19:15:08 2882298 INFO  Listening monitor notifications about state changes in formation "worker-3" and group 0
19:15:08 2882298 INFO  Following table displays times when notifications are received
    Time |    Name |  Node |          Host:Port |       Current State |      Assigned State
---------+---------+-------+--------------------+---------------------+--------------------
19:15:08 | node_16 |    19 | 157.*.*.*:5436 |             primary |            draining
19:15:08 | node_16 |    19 | 157.*.*.*:5436 |            draining |            draining
19:15:08 | node_14 |    17 | 116.*.*.*:5436 |           secondary |          report_lsn
19:15:08 | node_15 |    18 |  78.*.*.*:5436 |           secondary |          report_lsn
19:15:09 | node_14 |    17 | 116.*.*.*:5436 |          report_lsn |          report_lsn
19:16:09 2882298 ERROR Failed to receive monitor's notifications
19:16:09 2882298 ERROR Failed to wait until a new primary has been notified

Current state:

$ pg_autoctl show state --pgdata /var/lib/postgresql/monitor --formation worker-3

   Name |  Node |          Host:Port |        LSN |   Connection |       Current State |      Assigned State
--------+-------+--------------------+------------+--------------+---------------------+--------------------
node_14 |    17 | 116.*.*.*:5436 | 4/E98B3568 |       none ! |          report_lsn |          report_lsn
node_15 |    18 |  78.*.*.*:5436 | 4/E9000000 |  read-only ! |           secondary |          report_lsn
node_16 |    19 | 157.*.*.*:5436 | 4/E98B34F0 |       none ! |            draining |            draining

Service logs:

node_14 - 116...*:5436:

Apr 28 21:50:06 node_14pg_autoctl[11959]: 21:50:05 11959 INFO   /usr/bin/pg_autoctl do service postgres --pgdata /var/lib/postgresql/worker-3 -v
Apr 28 21:50:06 node_14pg_autoctl[11946]: 21:50:05 11946 INFO  Started pg_autoctl node-active service with pid 11960
Apr 28 21:50:06 node_14 pg_autoctl[11960]: 21:50:05 11960 INFO   /usr/bin/pg_autoctl do service node-active --pgdata /var/lib/postgresql/worker-3 -v
Apr 28 21:50:06 node_14 pg_autoctl[11960]: 21:50:06 11960 INFO  Reloaded the new configuration from "/var/lib/postgresql/.config/pg_autoctl/var/lib/postgresql/worker-3/pg_autoctl.cfg"
Apr 28 21:50:06 node_14 pg_autoctl[11960]: 21:50:06 11960 INFO  pg_autoctl service is running, current state is "report_lsn"
Apr 28 21:50:06 node_14 pg_autoctl[11960]: 21:50:06 11960 INFO  Fetched current list of 2 other nodes from the monitor to update HBA rules, including 2 changes.
Apr 28 21:50:06 node_14 pg_autoctl[11960]: 21:50:06 11960 INFO  Ensuring HBA rules for node 18 "node_15" (78.*.*.*:5436)
Apr 28 21:50:06 node_14 pg_autoctl[11960]: 21:50:06 11960 INFO  Ensuring HBA rules for node 19 "node_16" (157.*.*.*:5436)

node_15 - 78...*:5436:

Apr 28 22:06:09 node_15 pg_autoctl[3514691]: 22:06:09 3514691 ERROR 2021-04-28 22:05:59.371 CEST [3521606] WARNING:  specified neither primary_conninfo nor restore_command
Apr 28 22:06:09 node_15 pg_autoctl[3514691]: 22:06:09 3514691 ERROR 2021-04-28 22:05:59.371 CEST [3521606] HINT:  The database server will regularly poll the pg_wal subdirectory to che>
Apr 28 22:06:09 node_15 pg_autoctl[3514691]: 22:06:09 3514691 ERROR 2021-04-28 22:05:59.371 CEST [3521606] LOG:  entering standby mode
Apr 28 22:06:09 node_15 pg_autoctl[3514691]: 22:06:09 3514691 FATAL 2021-04-28 22:05:59.372 CEST [3521606] FATAL:  requested timeline 8 is not a child of this server's history
Apr 28 22:06:09 node_15 pg_autoctl[3514691]: 22:06:09 3514691 ERROR 2021-04-28 22:05:59.372 CEST [3521606] DETAIL:  Latest checkpoint is at 4/E985EC90 on timeline 7, but in the history>
Apr 28 22:06:09 node_15 pg_autoctl[3514691]: 22:06:09 3514691 ERROR 2021-04-28 22:05:59.374 CEST [3521604] LOG:  startup process (PID 3521606) exited with exit code 1
Apr 28 22:06:09 node_15 pg_autoctl[3514691]: 22:06:09 3514691 ERROR 2021-04-28 22:05:59.374 CEST [3521604] LOG:  aborting startup due to startup process failure
Apr 28 22:06:09 node_15 pg_autoctl[3514691]: 22:06:09 3514691 ERROR 2021-04-28 22:05:59.380 CEST [3521604] LOG:  database system is shut down
Apr 28 22:06:09 node_15 pg_autoctl[3514691]: 22:06:09 3514691 WARN  Failed to start Postgres instance at "/var/lib/postgresql/worker-3"
Apr 28 22:06:09 node_15 pg_autoctl[3521671]: 22:06:09 3521671 INFO   /usr/lib/postgresql/13/bin/postgres -D /var/lib/postgresql/worker-3 -p 5436 -h *
Apr 28 22:06:16 node_15 pg_autoctl[3514693]: 22:06:16 3514693 ERROR Failed to open file "/var/lib/postgresql/worker-3/postmaster.pid": No such file or directory
Apr 28 22:06:16 node_15 pg_autoctl[3514693]: 22:06:16 3514693 INFO  Is PostgreSQL at "/var/lib/postgresql/worker-3" up and running?
Apr 28 22:06:16 node_15 pg_autoctl[3514693]: 22:06:16 3514693 ERROR Failed to get Postgres pid, see above for details
Apr 28 22:06:16 node_15 pg_autoctl[3514693]: 22:06:16 3514693 ERROR Failed to ensure that Postgres is running in "/var/lib/postgresql/worker-3"
Apr 28 22:06:16 node_15 pg_autoctl[3514693]: 22:06:16 3514693 ERROR Failed to restart Postgres after changing its primary conninfo, see above for details
Apr 28 22:06:16 node_15 pg_autoctl[3514693]: 22:06:16 3514693 ERROR Failed to disconnect from failed primary node, see above for details
Apr 28 22:06:16 node_15 pg_autoctl[3514693]: 22:06:16 3514693 ERROR Failed to transition from state "secondary" to state "report_lsn", see above.
Apr 28 22:06:16 node_15 pg_autoctl[3514693]: 22:06:16 3514693 ERROR Failed to transition to state "report_lsn", retrying...
Apr 28 22:06:17 node_15 pg_autoctl[3514693]: 22:06:17 3514693 WARN  Postgres is not running and we are in state secondary
Apr 28 22:06:17 node_15 pg_autoctl[3514693]: 22:06:17 3514693 WARN  Failed to update the keeper's state from the local PostgreSQL instance.
Apr 28 22:06:17 node_15 pg_autoctl[3514693]: 22:06:17 3514693 INFO  Monitor assigned new state "report_lsn"
Apr 28 22:06:17 node_15 pg_autoctl[3514693]: 22:06:17 3514693 INFO  FSM transition from "secondary" to "report_lsn": Reporting the last write-ahead log location received
Apr 28 22:06:17 node_15 pg_autoctl[3514693]: 22:06:17 3514693 INFO  Restarting standby node to disconnect replication from failed primary node, to prepare failover
Apr 28 22:06:17 node_15 pg_autoctl[3514693]: 22:06:17 3514693 INFO  Stopping Postgres at "/var/lib/postgresql/worker-3"
Apr 28 22:06:17 node_15 pg_autoctl[3514693]: 22:06:17 3514693 INFO  Creating the standby signal file at "/var/lib/postgresql/worker-3/standby.signal", and replication setup at "/var/li>
Apr 28 22:06:17 node_15 pg_autoctl[3514693]: 22:06:17 3514693 INFO  Contents of "/var/lib/postgresql/worker-3/postgresql-auto-failover-standby.conf" have changed, overwriting
Apr 28 22:06:17 node_15 pg_autoctl[3514693]: 22:06:17 3514693 INFO  Restarting Postgres at "/var/lib/postgresql/worker-3"
Apr 28 22:06:19 node_15 pg_autoctl[3514691]: 22:06:19 3514691 ERROR Failed to open file "/var/lib/postgresql/worker-3/postmaster.pid": No such file or directory
Apr 28 22:06:19 node_15 pg_autoctl[3514691]: 22:06:19 3514691 INFO  Is PostgreSQL at "/var/lib/postgresql/worker-3" up and running?
Apr 28 22:06:19 node_15 pg_autoctl[3514691]: 22:06:19 3514691 ERROR Failed to get Postgres pid, see above for details
Apr 28 22:06:19 node_15 pg_autoctl[3514691]: 22:06:19 3514691 WARN  Postgres logs from "/var/lib/postgresql/worker-3/startup.log":
Apr 28 22:06:19 node_15 pg_autoctl[3514691]: 22:06:19 3514691 ERROR 2021-04-28 22:06:09.281 CEST [3521671] LOG:  number of prepared transactions has not been configured, overriding
Apr 28 22:06:19 node_15 pg_autoctl[3514691]: 22:06:19 3514691 ERROR 2021-04-28 22:06:09.281 CEST [3521671] DETAIL:  max_prepared_transactions is now set to 200
Apr 28 22:06:19 node_15 pg_autoctl[3514691]: 22:06:19 3514691 ERROR 2021-04-28 22:06:09.310 CEST [3521671] LOG:  redirecting log output to logging collector process
Apr 28 22:06:19 node_15 pg_autoctl[3514691]: 22:06:19 3514691 ERROR 2021-04-28 22:06:09.310 CEST [3521671] HINT:  Future log output will appear in directory "log".
Apr 28 22:06:19 node_15 pg_autoctl[3514691]: 22:06:19 3514691 WARN  Postgres logs from "/var/lib/postgresql/worker-3/log/postgresql-2021-04-28_220609.log":
Apr 28 22:06:19 node_15 pg_autoctl[3514691]: 22:06:19 3514691 ERROR 2021-04-28 22:06:09.310 CEST [3521671] LOG:  starting PostgreSQL 13.2 (Ubuntu 13.2-1.pgdg20.04+1) on x86_64-pc-linux>
Apr 28 22:06:19 node_15 pg_autoctl[3514691]: 22:06:19 3514691 ERROR 2021-04-28 22:06:09.311 CEST [3521671] LOG:  listening on IPv4 address "0.0.0.0", port 5436
Apr 28 22:06:19 node_15 pg_autoctl[3514691]: 22:06:19 3514691 ERROR 2021-04-28 22:06:09.311 CEST [3521671] LOG:  listening on IPv6 address "::", port 5436
Apr 28 22:06:19 node_15 pg_autoctl[3514691]: 22:06:19 3514691 ERROR 2021-04-28 22:06:09.313 CEST [3521671] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5436"
Apr 28 22:06:19 node_15 pg_autoctl[3514691]: 22:06:19 3514691 ERROR 2021-04-28 22:06:09.320 CEST [3521673] LOG:  database system was shut down at 2021-04-28 21:52:09 CEST
Apr 28 22:06:19 node_15 pg_autoctl[3514691]: 22:06:19 3514691 ERROR 2021-04-28 22:06:09.320 CEST [3521673] WARNING:  specified neither primary_conninfo nor restore_command
Apr 28 22:06:19 node_15 pg_autoctl[3514691]: 22:06:19 3514691 ERROR 2021-04-28 22:06:09.320 CEST [3521673] HINT:  The database server will regularly poll the pg_wal subdirectory to che>
Apr 28 22:06:19 node_15 pg_autoctl[3514691]: 22:06:19 3514691 ERROR 2021-04-28 22:06:09.320 CEST [3521673] LOG:  entering standby mode
Apr 28 22:06:19 node_15 pg_autoctl[3514691]: 22:06:19 3514691 FATAL 2021-04-28 22:06:09.320 CEST [3521673] FATAL:  requested timeline 8 is not a child of this server's history
Apr 28 22:06:19 node_15 pg_autoctl[3514691]: 22:06:19 3514691 ERROR 2021-04-28 22:06:09.320 CEST [3521673] DETAIL:  Latest checkpoint is at 4/E985EC90 on timeline 7, but in the history>
Apr 28 22:06:19 node_15 pg_autoctl[3514691]: 22:06:19 3514691 ERROR 2021-04-28 22:06:09.323 CEST [3521671] LOG:  startup process (PID 3521673) exited with exit code 1
Apr 28 22:06:19 node_15 pg_autoctl[3514691]: 22:06:19 3514691 ERROR 2021-04-28 22:06:09.323 CEST [3521671] LOG:  aborting startup due to startup process failure
Apr 28 22:06:19 node_15 pg_autoctl[3514691]: 22:06:19 3514691 ERROR 2021-04-28 22:06:09.329 CEST [3521671] LOG:  database system is shut down
Apr 28 22:06:19 node_15 pg_autoctl[3514691]: 22:06:19 3514691 WARN  Failed to start Postgres instance at "/var/lib/postgresql/worker-3"
Apr 28 22:06:19 node_15 pg_autoctl[3521734]: 22:06:19 3521734 INFO   /usr/lib/postgresql/13/bin/postgres -D /var/lib/postgresql/worker-3 -p 5436 -h *
Apr 28 22:06:20 node_15 pg_autoctl[3514693]: 22:06:20 3514693 WARN  Failed to read Postgres "postmaster.pid" file

node_16 - 157...:

Apr 28 21:30:44 node_16 systemd[1]: Started pg_auto_failover.
Apr 28 21:30:44 node_16 pg_autoctl[471]: 21:30:44 471 INFO  Started pg_autoctl postgres service with pid 503
Apr 28 21:30:44 node_16 pg_autoctl[503]: 21:30:44 503 INFO   /usr/bin/pg_autoctl do service postgres --pgdata /var/lib/postgresql/worker-3 -v
Apr 28 21:30:44 node_16 pg_autoctl[471]: 21:30:44 471 INFO  Started pg_autoctl node-active service with pid 505
Apr 28 21:30:44 node_16 pg_autoctl[505]: 21:30:44 505 INFO   /usr/bin/pg_autoctl do service node-active --pgdata /var/lib/postgresql/worker-3 -v
Apr 28 21:30:44 node_16 pg_autoctl[505]: 21:30:44 505 INFO  Reloaded the new configuration from "/var/lib/postgresql/.config/pg_autoctl/var/lib/postgresql/worker-3/pg_autoctl.cfg"
Apr 28 21:30:44 node_16 pg_autoctl[505]: 21:30:44 505 INFO  pg_autoctl service is running, current state is "draining"
Apr 28 21:30:45 node_16 pg_autoctl[505]: 21:30:45 505 INFO  Fetched current list of 2 other nodes from the monitor to update HBA rules, including 2 changes.
Apr 28 21:30:45 node_16 pg_autoctl[505]: 21:30:45 505 INFO  Ensuring HBA rules for node 17 "node_14" (116.*.*.*:5436)
Apr 28 21:30:45 node_16 pg_autoctl[505]: 21:30:45 505 INFO  Ensuring HBA rules for node 18 "node_15" (78.*.*.*:5436)

How can I recover this? Is this because I did a manual switchover? or it might also happen in normal operations? I'm testing the cluster for production use but this got me worried about using it in production.

Navid2zp avatar Apr 28 '21 20:04 Navid2zp

Tried rebooting the node_15 server but still the same. Tried pg_autoctl run:

$ pg_autoctl run --pgdata /var/lib/postgresql/worker-3
06:02:15 3707 FATAL An instance of pg_autoctl is already running with PID 470, as seen in pidfile "/tmp/pg_autoctl/var/lib/postgresql/worker-3/pg_autoctl.pid"
06:02:15 3707 FATAL Failed to setup pg_autoctl pidfile and signal handlers
06:02:15 3707 FATAL Failed to start pg_autoctl keeper service, see above for details

Navid2zp avatar Apr 29 '21 04:04 Navid2zp

Hmm that looks like a bad situation indeed and this might be a bug in pg_auto_failover. This part of the log tells the problem:

Apr 28 22:06:19 node_15 pg_autoctl[3514691]: 22:06:19 3514691 ERROR 2021-04-28 22:06:09.320 CEST [3521673] LOG:  entering standby mode
Apr 28 22:06:19 node_15 pg_autoctl[3514691]: 22:06:19 3514691 FATAL 2021-04-28 22:06:09.320 CEST [3521673] FATAL:  requested timeline 8 is not a child of this server's history
Apr 28 22:06:19 node_15 pg_autoctl[3514691]: 22:06:19 3514691 ERROR 2021-04-28 22:06:09.320 CEST [3521673] DETAIL:  Latest checkpoint is at 4/E985EC90 on timeline 7, but in the history>
Apr 28 22:06:19 node_15 pg_autoctl[3514691]: 22:06:19 3514691 ERROR 2021-04-28 22:06:09.323 CEST [3521671] LOG:  startup process (PID 3521673) exited with exit code 1
Apr 28 22:06:19 node_15 pg_autoctl[3514691]: 22:06:19 3514691 ERROR 2021-04-28 22:06:09.323 CEST [3521671] LOG:  aborting startup due to startup process failure
Apr 28 22:06:19 node_15 pg_autoctl[3514691]: 22:06:19 3514691 ERROR 2021-04-28 22:06:09.329 CEST [3521671] LOG:  database system is shut down

postgres is unable to start, because it cannot recover to the requested timeline.

This line is truncated:

Apr 28 22:06:19 node_15 pg_autoctl[3514691]: 22:06:19 3514691 ERROR 2021-04-28 22:06:09.320 CEST [3521673] DETAIL:  Latest checkpoint is at 4/E985EC90 on timeline 7, but in the history>

Could you share the full line from /var/lib/postgresql/worker-3/log/postgresql-2021-04-28_220609.log?

Your second error (after rebooting) suggests that pg_autoctl is already running. Are you sure it is not? If you are super sure, then it is safe to manually remove the PID file using:

rm /tmp/pg_autoctl/var/lib/postgresql/worker-3/pg_autoctl.pid

PS. Instead of switchover, you would probably want to use pg_autoctl perform promotion --name node_15 next time. That way it will fail over to the node that you specify.

JelteF avatar Apr 29 '21 08:04 JelteF

Also if this is a prod cluster and you just want to recover as soon as possible, the easiest would be to drop the node that is stuck from the pg_autoctl cluster:

pg_autoctl drop node --hostname 78.*.*.* --pgport 5436 --pgdata /var/lib/postgresql/monitor

JelteF avatar Apr 29 '21 08:04 JelteF

Thanks for your reply @JelteF

Sorry about the truncated part, missed it. Here are the full contents of /var/lib/postgresql/worker-3/log/postgresql-2021-04-28_220609.log:

2021-04-28 22:06:09.310 CEST [3521671] LOG:  starting PostgreSQL 13.2 (Ubuntu 13.2-1.pgdg20.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, 64-bit
2021-04-28 22:06:09.311 CEST [3521671] LOG:  listening on IPv4 address "0.0.0.0", port 5436
2021-04-28 22:06:09.311 CEST [3521671] LOG:  listening on IPv6 address "::", port 5436
2021-04-28 22:06:09.313 CEST [3521671] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5436"
2021-04-28 22:06:09.320 CEST [3521673] LOG:  database system was shut down at 2021-04-28 21:52:09 CEST
2021-04-28 22:06:09.320 CEST [3521673] WARNING:  specified neither primary_conninfo nor restore_command
2021-04-28 22:06:09.320 CEST [3521673] HINT:  The database server will regularly poll the pg_wal subdirectory to check for files placed there.
2021-04-28 22:06:09.320 CEST [3521673] LOG:  entering standby mode
2021-04-28 22:06:09.320 CEST [3521673] FATAL:  requested timeline 8 is not a child of this server's history
2021-04-28 22:06:09.320 CEST [3521673] DETAIL:  Latest checkpoint is at 4/E985EC90 on timeline 7, but in the history of the requested timeline, the server forked off from that timeline at 4/E985E8D0.
2021-04-28 22:06:09.323 CEST [3521671] LOG:  startup process (PID 3521673) exited with exit code 1
2021-04-28 22:06:09.323 CEST [3521671] LOG:  aborting startup due to startup process failure
2021-04-28 22:06:09.329 CEST [3521671] LOG:  database system is shut down

And for the second error, pg_autoctl was running. I forgot to stop the systemd service before running the command.

It's not a prod cluster, Mostly testing things out before using it in prod but I do have some data that I prefer not to lose but it's okay if I have to. All the nodes are stuck at this point but only 78.*.*.* is throwing errors. Will dropping this node help other nodes?

Thanks for your help

Navid2zp avatar Apr 29 '21 08:04 Navid2zp

To help debug a bit more could you share the output of the following commands?

  1. pg_autoctl show state
  2. pg_autoctl show settings
  3. pg_autoctl show events

If you run the drop node command on a secondary and the old primary is still up there isn't dataloss, since the primary has the most recent data. So if you simply want to continue, you can use that. The problem here is that because node 15 cannot start, the election never begins. Dropping it means that it's not part of the cluster anymore, so it doesn't block the election either.

We'll definitly look into this soon though, since this shouldn't happen.

JelteF avatar Apr 29 '21 13:04 JelteF

Here are the outputs you asked for:

state:

$ pg_autoctl show state --pgdata /var/lib/postgresql/monitor --formation worker-3
   Name |  Node |          Host:Port |        LSN |   Connection |       Current State |      Assigned State
--------+-------+--------------------+------------+--------------+---------------------+--------------------
node_14 |    17 | 116.*.*.*:5436 | 4/E98B3568 |       none ! |          report_lsn |          report_lsn
node_15 |    18 |  78.*.*.*:5436 | 4/E9000000 |  read-only ! |           secondary |          report_lsn
node_16 |    19 | 157.*.*.*:5436 | 4/E98B34F0 |       none ! |            draining |            draining

settings:

$ pg_autoctl show settings --pgdata /var/lib/postgresql/monitor --formation worker-3
  Context |     Name |              Setting | Value
----------+----------+----------------------+------
formation | worker-3 | number_sync_standbys | 1
     node |  node_14 |   candidate priority | 50
     node |  node_15 |   candidate priority | 50
     node |  node_16 |   candidate priority | 50
     node |  node_14 |   replication quorum | true
     node |  node_15 |   replication quorum | true
     node |  node_16 |   replication quorum | true

events:

$ pg_autoctl show events --pgdata /var/lib/postgresql/monitor --formation worker-3

                    Event Time |   Node |       Current State |      Assigned State | Comment
-------------------------------+--------+---------------------+---------------------+-----------
  2021-04-28 13:47:30.54886+00 |   0/17 |             demoted |          catchingup | Setting goal state of node 17 "node_14" (116.*.*.*:5436) to catchingup after it converged to demotion and node 19 "node_16" (157.*.*.*:5436) converged to primary.
 2021-04-28 13:48:03.597732+00 |   0/17 |          catchingup |          catchingup | New state is reported by node 17 "node_14" (116.*.*.*:5436): "catchingup"
 2021-04-28 13:48:03.853224+00 |   0/19 |             primary |             primary | New state is reported by node 19 "node_16" (157.*.*.*:5436): "primary"
 2021-04-28 13:48:03.915231+00 |   0/17 |          catchingup |           secondary | Setting goal state of node 17 "node_14" (116.*.*.*:5436) to secondary after it caught up.
 2021-04-28 13:48:04.047772+00 |   0/17 |           secondary |           secondary | New state is reported by node 17 "node_14" (116.*.*.*:5436): "secondary"
 2021-04-28 19:15:08.674059+00 |   0/19 |             primary |            draining | Setting goal state of node 19 "node_16" (157.*.*.*:5436)at LSN 4/E98B34F0 to draining after a user-initiated failover.
 2021-04-28 19:15:08.905275+00 |   0/19 |            draining |            draining | New state is reported by node 19 "node_16" (157.*.*.*:5436): "draining"
 2021-04-28 19:15:08.905275+00 |   0/17 |           secondary |          report_lsn | Setting goal state of node 17 "node_14" (116.*.*.*:5436) to report_lsn to find the failover candidate
 2021-04-28 19:15:08.905275+00 |   0/18 |           secondary |          report_lsn | Setting goal state of node 18 "node_15" (78.*.*.*:5436) to report_lsn to find the failover candidate
 2021-04-28 19:15:09.396321+00 |   0/17 |          report_lsn |          report_lsn | New state is reported by node 17 "node_14" (116.*.*.*:5436) with LSN 4/E98B3568: report_lsn

I will keep the cluster as is for a bit longer so I can provide more information in case you needed more to find the problem. I will also check the events and commands that happened/executed to produce the problem to see if I missed anything or can find anything that might help more. Thanks again.

Navid2zp avatar Apr 29 '21 14:04 Navid2zp

Hi @Navid2zp,

Thanks for keeping this cluster around and helping us debug the situation here. What I can see in the Postgres source code for this error message and your own logs leads me to believe that node 15 might have missed the latest streaming of WAL bits and history files. Is it possible that the manual switchover started “too early”?

The state of the node being reported as “secondary” means it should have been okay, so maybe that's where we have a bug, where we would switch to the “secondary” state a little to fast, possibly open a window where the node doesn't have received the current timeline from the new primary after a failover.

Could you paste the listing of the most recent files in the pg_wal directory of each node? A command such as ls -ltr | tail should do. And then can you also paste the content of the pg_wal/*.history files on each node?

DimCitus avatar May 03 '21 12:05 DimCitus

Hi @DimCitus

If by “too early” you mean too soon before all servers are up and running, then no. I remember checking all the servers before switchover and they were up. I even had 2 more instances on those servers and they were all running just fine. And nothing happened after switchover failed since at least two out of those three servers were hosting a primary node.

Unfortunately, I tried dropping the failed nodes and now the cluster is fine and the oldest file in the pg_wal directory is from May 1 14:02 which I'm not sure if they might help. Let me know if they might and I'll post them here.

Also worth mentioning that dropping only node node_15 didn't help and other nodes were stuck just like before but dropping node_14 (If I'm not wrong) did the trick.

Navid2zp avatar May 03 '21 13:05 Navid2zp