stolon spec.CheckpointBeforePgrewind

Provide a configuration parameter that enables issuing of a CHECKPOINT prior to starting a pg_rewind, ensuring the control file is up-to-date with the latest timeline information before the rewind process starts.

When this setting is disabled, it's possible for a newly promoted master to have a control file with a timeline from before the promotion occured. If a follower tries to resync using rewind against this database it will quit without doing anything as it will see no timeline deviation when it reads the control file, stating "no rewind required".

Some users run large databases where the cost of a basebackup is many times more expensive than rewind. These users would probably trade the performance impact of immediate checkpointing for faster recovery of the standby, especially for those using synchronous replication who may be unable to accept writes until the standby is recovered.

For now, this change will at least enable a "rewind at all costs" option as opposed to the "maybe rewind if possible" that existed before. Future work might support a retry timeout for rewind operations, for those users who don't want the performance hit but do value rewinding over basebackups. We could also use the backup API to trigger a checkpoint with the target Postgres spread configuration, then retry rewinding until the checkpoint_timeout has elapsed.

I think this is in a good place now. I've been running these tests like so:

$ STCTL_BIN="$(pwd)/bin/stolonctl" STKEEPER_BIN="$(pwd)/bin/stolon-keeper" STSENTINEL_BIN="$(pwd)/bin/stolon-sentinel" STDCTL_BIN="$(pwd)/bin/stolonctl" STPROXY_BIN="$(pwd)/bin/stolon-proxy" STOLON_TEST_STORE_BACKEND=etcdv3 ETCD_BIN=$(which etcd) go test tests/integration/*.go -v -test.run=TestTimelineForkPgRewindCheckpoint -timeout=3m

If you build the stolon binaries with master (which will silently ignore the CheckpointBeforePgrewind configuration parameter) then you'll see the following failure:

    utils.go:260: [keeper fdc71fc5]: 2019-06-12T17:26:45.669+0100       INFO    cmd/keeper.go:824       syncing using pg_rewind {"followedDB": "d6a67b88", "keeper": "e3414a18"}                                                                                                                                                                                                                                 
    utils.go:260: [keeper fdc71fc5]: 2019-06-12T17:26:45.669+0100       INFO    postgresql/postgresql.go:789    running pg_rewind                                                                                                                                                                                                                                                                                
    ha_test.go:1275: keeper tried rewinding but rewind thought it was not required: fdc71fc5                                                                                                                                                                                                                                                                                                                     
    utils.go:260: [keeper fdc71fc5]: fetched file "global/pg_control", length 8192                                                                                                                                                                                                                                                                                                                               
    utils.go:260: [keeper fdc71fc5]: source and target cluster are on the same timeline                                                                                                                                                                                                                                                                                                                          
    utils.go:260: [keeper fdc71fc5]: no rewind required

Running with binaries built against this branch, the tests pass and the log output looks like this:

    ha_test.go:1213: Waiting 2 * sleepInterval for checkpoint configuration to be applied
    ha_test.go:1216: Creating lots of un-CHECKPOINT'd data
    ha_test.go:1221: Write token 1 to master: f0333453
    ha_test.go:1234: SIGSTOPping current standby keeper: 17d6df63
    ha_test.go:1238: SIGSTOPping current standby wal receivers: 17d6df63
    ha_test.go:1243: Writing token 2 to master, advancing current timeline: f0333453
    ha_test.go:1248: Stopping master: f0333453
    ha_test.go:1255: SIGKILLing current standby wal receivers: 17d6df63
    ha_test.go:1261: SIGCONTing current standby keeper: 17d6df63
    ha_test.go:1267: Waiting for new master to assume role: 17d6df63
    ha_test.go:1272: Write token 3 to new master, advancing current timeline: 17d6df63
    ha_test.go:1279: Starting old master and verifying it successfully rewinds: f0333453
    utils.go:260: [keeper f0333453]: 2019-06-12T17:59:57.228+0100       INFO    cmd/keeper.go:844       syncing using pg_rewind {"followedDB": "51bca7ec", "keeper": "17d6df63", "forcingCheckpoint": true}
    utils.go:260: [keeper f0333453]: 2019-06-12T17:59:57.229+0100       INFO    postgresql/postgresql.go:804    issuing checkpoint on primary
    utils.go:260: [keeper 17d6df63]: 2019-06-12 17:59:57.293 BST [23830] LOG:  checkpoint complete: wrote 1671 buffers (10.2%); 0 WAL file(s) added, 0 removed, 0 recycled; write=4.847 s, sync=0.014 s, total=4.863 s; sync files=335, longest=0.000 s, average=0.000 s; distance=34075 kB, estimate=34075 kB
    utils.go:260: [keeper 17d6df63]: 2019-06-12 17:59:57.295 BST [23830] LOG:  checkpoint starting: immediate force wait
    utils.go:260: [keeper 17d6df63]: 2019-06-12 17:59:57.296 BST [23830] LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.001 s; sync files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=30668 kB
    utils.go:260: [keeper f0333453]: CHECKPOINT
    utils.go:260: [keeper f0333453]: 2019-06-12T17:59:57.351+0100       INFO    postgresql/postgresql.go:815    running pg_rewind
    utils.go:260: [keeper f0333453]: fetched file "global/pg_control", length 8192
    utils.go:260: [keeper f0333453]: fetched file "pg_wal/00000002.history", length 41
    utils.go:260: [keeper f0333453]: Source timeline history:
    utils.go:260: [keeper f0333453]: Target timeline history:
    utils.go:260: [keeper f0333453]: 1: 0/0 - 0/0
    utils.go:260: [keeper f0333453]: servers diverged at WAL location 0/4146FE0 on timeline 1
    utils.go:260: [keeper f0333453]: rewinding from last common checkpoint at 0/2000060 on timeline 1
    ha_test.go:1304: Waiting for the old master to receive token 3 from forked timeline: f0333453

May 17 '19 10:05 lawrencejones

This actually addresses #601, though not via any of the remediations recommended in that issue.

May 17 '19 12:05 lawrencejones

I've also experienced this issue. @sgotti why wal_keep_segments is hard-coded to such small value as 8 and can I increase it manually in code as a temporary workaround of this problem?

May 29 '19 17:05 rearden-steel

wal_keep_segments

how to change wal_keep_segments？

Jun 05 '19 06:06 Widerstehen

Hey @rearden-steel @frank111, I'm not sure how this is related to the wal keep segments GUC? This change is about recovering from forked timelines rather than having insufficient wal to perform a recovery.

can I increase it manually in code

We've done so in our stolon fork as we also needed a higher value than 8. It's safe to do so, but I'd recommend pushing https://github.com/sorintlab/stolon/pull/591 so we can get this solved for everyone.

Jun 05 '19 08:06 lawrencejones

Hey @lawrencejones! Is there any chance to make the requested changes? We experience issues which should be fixed with this PR.

Jun 11 '19 11:06 rearden-steel

Hey @rearden-steel, I'm.midway through making this work. It needs integration tests which are what I've been working on, and to provide a configurable variable that enables/disables checkpointing.

I'm out at a conference today/tomorrow but this is right on my to-do list come Thursday. I'm going to push the WIP changes now so you can see them, and caveat this update by saying I'm explicitly ignoring Simone's advice until I've got the tests working (at which point I'll action them).

Jun 11 '19 13:06 lawrencejones

From where I got to last time, it's difficult to simulate a Postgres that takes a while to checkpoint after a promotion. While pg_ctl promote won't force a synchronous checkpoint it will schedule one, and in the test databases that scheduled checkpoint is almost instantaneous leading to the controlfile being updated immediately.

Hopefully you can wait a couple of days for me to get back to this?

Jun 11 '19 13:06 lawrencejones

it's difficult to simulate a Postgres that takes a while to checkpoint after a promotion

I was able to reach such situation when doing manual failover while running pgbench (-c 100 -T 120 -j 20) on a cluster (with pre-generated database around 80 GB, pgbench -i -s 5000).

Hopefully you can wait a couple of days for me to get back to this?

Sure! Thank you for your work!

Jun 11 '19 14:06 rearden-steel

@sgotti @rearden-steel this is ready for review now.

Instead of adding this logic into an existing test, I've created another test with just two keepers that verifies we're safe from this specific race. The details of that test can be seen here: https://github.com/sorintlab/stolon/blob/bd61324616e4e501ec511e10f27e2c09fac8e130/tests/integration/ha_test.go#L1174-L1187

I've added some test helpers and I'm going to go through the diff and comment where things may not be totally clear.

Any questions, please do ask!

Jun 12 '19 17:06 lawrencejones

Hey @sgotti! I've been rubbish here: PRs were left hanging while I went on holiday 🏖

I'm going to reply to each bit of your review separately. Hopefully we can reason our way to a conclusion!

Beside the inline comments, one thing that I'd like to avoid at all costs is issuing "write" commands from one keeper to another instance since it looks quite dirty (against the fact that the postgres should be managed only by its keeper and, in this case, every resync will issue an unneeded checkpoint).

So, shouldn't be possible to just issue a checkpoint after a promote by the primary keeper?

Postgres will automatically begin a checkpoint whenever a promote happens, but it will do so in a spread fashion (ie, using all the checkpointing configuration parameters to ensure it throttles the checkpointing). If we were to get the keeper to issue a checkpoint itself after promotion (optionally, via a similar configuration parameter as this PR) then we'd force the master to checkpoint faster, which is good but...

At the same time as the master checkpoints, our slaves will attempt to resync against the master. Until the checkpoint completes the pg_rewind could fail, and we'll fallback to pg_basebackup, which is what we really want to avoid.

If we really want to avoid this, then a possible solution would be to provide this checkpoint-after-promote functionality we've been discussing, but additionally have keepers retry pg_rewind for a specific time threshold. A user could then set the retry timeout to be 5m, for example, and enable checkpoint-after-promote to ensure the rewind will be retried after the checkpoint is expected to be completed.

The main issue I see here is that there'll be a window between issuing a promote and finishing the checkpoint where the keeper already reported its state to the sentinel and it has updated the old primary follow config.

One possible solution to fix this and also other realted issues will be to report to the sentinel the current checkpoint status after promote or the checkpoint timeline (new field in cluster.go DBStatus) so the sentinel will set the old primary followConfig only when it's sure it's in good state for a pg_rewind.

It's quite sad for me that we have to do this, if you know what I mean? pg_basebackup has an option to wait for a checkpoint to complete before starting, so it's quite sad to me that pg_rewind doesn't also have this option. Perhaps we can simulate it by creating a backup label, as that appears to be what pg_basebackup does internally? Do you have any thoughts on this?

Jul 05 '19 14:07 lawrencejones

@sgotti I know you're probably busy on a lot of other things right now, but would really appreciate your feedback for this change. It matters quite a lot to us that rewinds are working, and this change makes that the case.

If we want to take another path that's great, but we'll need to know what it is before we can get implementing it!

Aug 19 '19 11:08 lawrencejones

@lawrencejones Also if theoretically I'd prefer a solution that doesn't require doing operations from a keeper to another, in the end this is already done when doing a pg_basebackup or a pg_rewind.

So I don't want to stop this PR trying to find a cleanliness that's not really possible. Lets go ahead with your solution! (in the end we are just doing what pg_rewind should do: a checkpoint inside pg_rewind like pg_basebackup already does)

P.S. We should report this upstream if not already done.

Aug 23 '19 13:08 sgotti

Hey @sgotti!

So I don't want to stop this PR trying to find a cleanliness that's not really possible. Lets go ahead with your solution! (in the end we are just doing what pg_rewind should do: a checkpoint inside pg_rewind like pg_basebackup already does)

Makes sense to me :)

I've adjusted the doc comment to make it clearer we'll be hitting the primary. All your other review comments were questions/have been addressed, I think.

Can I get another look? Thanks.

Aug 27 '19 11:08 lawrencejones

Hi Team, wondering any updates on this? Anything I can help with to get this merged?

Jun 23 '20 00:06 viggy28

@viggy28 If no one is going to work in this PR (@lawrencejones) feel free to adopt and update it.

Jun 27 '20 11:06 sgotti

Hey @viggy28, if you want to take this over then please let me know. I’m not working on stolon things right now so if you need this, I’ll be happy to hand it over!

Jun 30 '20 19:06 lawrencejones

Sure @lawrencejones. Looks like you have done most of the work. Thanks much.

Trying to understand what is needed.

https://github.com/sorintlab/stolon/pull/644/files#r293416129
https://github.com/sorintlab/stolon/pull/644#pullrequestreview-249368918 is unresolved but @sgotti also mentioned here https://github.com/sorintlab/stolon/pull/644#issuecomment-524313267 that shouldn't be a blocker. which means can we resolve that?
Looks like the build is failing.

Anything else I am missing?

Jul 01 '20 04:07 viggy28

Dear contributors , We have a stolon cluster which is such big to cost hours to recovery with basebackup. I apply this change to latest master and run on our test environment . the standby keepers keep failed after pg_rewind done :


pg_rewind: Done!
2024-02-27T09:24:12.615Z	INFO	cmd/keeper.go:928	syncing using pg_rewind	{"masterDB": "1f5bbc57", "keeper": "keeper0"}
2024-02-27T09:24:12.636Z	INFO	postgresql/postgresql.go:921	issuing checkpoint on primary
CHECKPOINT
2024-02-27T09:24:12.993Z	INFO	postgresql/postgresql.go:932	running pg_rewind
2024-02-27T09:24:12.994Z	DEBUG	postgresql/postgresql.go:936	execing cmd	{"cmd": "/usr/lib/postgresql/14/bin/pg_rewind --debug -D /stolon-data/postgres --source-server=application_name=stolon_7805b990 dbname=zabbix host=172.16.250.153 options=-c\\ synchronous_commit=off password=******************** port=5432 sslmode=prefer user=stolon"}
pg_rewind: executing "/usr/lib/postgresql/14/bin/postgres" for target server to complete crash recovery
2024-02-27 09:24:13.068 GMT [1590] LOG:  skipping missing configuration file "/stolon-data/postgres/postgresql.auto.conf"
2024-02-27T09:24:13.173Z	ERROR	cmd/keeper.go:736	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2024-02-27T09:24:15.675Z	ERROR	cmd/keeper.go:736	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2024-02-27 09:24:16.349 UTC [1590] LOG:  database system was interrupted while in recovery at log time 2024-02-27 09:23:11 UTC
2024-02-27 09:24:16.349 UTC [1590] HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2024-02-27 09:24:16.455 UTC [1590] LOG:  redo starts at B/280161F8
2024-02-27 09:24:16.556 UTC [1590] LOG:  invalid record length at B/28182E08: wanted 24, got 0
2024-02-27 09:24:16.556 UTC [1590] LOG:  redo done at B/28182DE0 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.10 s

PostgreSQL stand-alone backend 14.0 (Debian 14.0-1.pgdg110+1)
2024-02-27T09:24:17.548Z	DEBUG	postgresql/postgresql.go:625	execing cmd	{"cmd": "/usr/lib/postgresql/14/bin/postgres -V"}
2024-02-27T09:24:17.561Z	DEBUG	cmd/keeper.go:815	pgstate dump: (*cluster.PostgresState)(0xc0008f84d0)({
 UID: (string) (len=8) "7805b990",
 Generation: (int64) 0,
 ListenAddress: (string) (len=14) "172.16.115.237",
 Port: (string) (len=4) "5432",
 Healthy: (bool) false,
 SystemID: (string) "",
 TimelineID: (uint64) 0,
 XLogPos: (uint64) 0,
 TimelinesHistory: (cluster.PostgresTimelinesHistory) <nil>,
 PGParameters: (common.Parameters) <nil>,
 SynchronousStandbys: ([]string) <nil>,
 OlderWalFile: (string) ""
})

2024-02-27T09:24:18.177Z	ERROR	cmd/keeper.go:736	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
pg_rewind: fetched file "global/pg_control", length 8192
pg_rewind: source and target cluster are on the same timeline
pg_rewind: no rewind required
2024-02-27T09:24:18.222Z	DEBUG	postgresql/postgresql.go:625	execing cmd	{"cmd": "/usr/lib/postgresql/14/bin/postgres -V"}
2024-02-27T09:24:18.324Z	INFO	postgresql/postgresql.go:844	writing standby signal file
2024-02-27T09:24:18.368Z	INFO	postgresql/postgresql.go:319	starting database
2024-02-27T09:24:18.368Z	DEBUG	postgresql/postgresql.go:323	execing cmd	{"cmd": "/usr/lib/postgresql/14/bin/postgres -D /stolon-data/postgres -c unix_socket_directories=/tmp"}
2024-02-27T09:24:20.682Z	ERROR	cmd/keeper.go:736	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2024-02-27 09:24:21.540 UTC [1593] LOG:  starting PostgreSQL 14.0 (Debian 14.0-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2024-02-27 09:24:21.540 UTC [1593] LOG:  listening on IPv4 address "172.16.115.237", port 5432
2024-02-27 09:24:21.589 UTC [1593] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2024-02-27 09:24:22.064 UTC [1594] LOG:  database system was shut down at 2024-02-27 09:24:17 UTC
2024-02-27 09:24:22.064 UTC [1595] FATAL:  the database system is starting up
2024-02-27 09:24:22.085 UTC [1594] LOG:  entering standby mode
2024-02-27 09:24:22.221 UTC [1594] LOG:  invalid record length at B/28182EF8: wanted 24, got 0
2024-02-27 09:24:22.245 UTC [1598] LOG:  started streaming WAL from primary at B/28000000 on timeline 31
2024-02-27 09:24:22.267 UTC [1599] FATAL:  the database system is not yet accepting connections
2024-02-27 09:24:22.267 UTC [1599] DETAIL:  Consistent recovery state has not been yet reached.
2024-02-27 09:24:22.271 UTC [1594] LOG:  invalid resource manager ID 48 at B/28182EF8
2024-02-27 09:24:22.279 UTC [1598] FATAL:  terminating walreceiver process due to administrator command
2024-02-27 09:24:22.279 UTC [1594] LOG:  invalid resource manager ID 48 at B/28182EF8
2024-02-27 09:24:22.280 UTC [1594] LOG:  invalid resource manager ID 48 at B/28182EF8
2024-02-27 09:24:22.469 UTC [1601] FATAL:  the database system is not yet accepting connections
2024-02-27 09:24:22.469 UTC [1601] DETAIL:  Consistent recovery state has not been yet reached.
2024-02-27T09:24:22.583Z	DEBUG	postgresql/postgresql.go:625	execing cmd	{"cmd": "/usr/lib/postgresql/14/bin/postgres -V"}
2024-02-27T09:24:22.593Z	DEBUG	cmd/keeper.go:815	pgstate dump: (*cluster.PostgresState)(0xc0006ac630)({
 UID: (string) (len=8) "7805b990",
 Generation: (int64) 0,
 ListenAddress: (string) (len=14) "172.16.115.237",
 Port: (string) (len=4) "5432",
 Healthy: (bool) false,
 SystemID: (string) "",
 TimelineID: (uint64) 0,
 XLogPos: (uint64) 0,
 TimelinesHistory: (cluster.PostgresTimelinesHistory) <nil>,
 PGParameters: (common.Parameters) <nil>,
 SynchronousStandbys: ([]string) <nil>,
 OlderWalFile: (string) ""
})

2024-02-27 09:24:22.671 UTC [1603] FATAL:  the database system is not yet accepting connections
2024-02-27 09:24:22.671 UTC [1603] DETAIL:  Consistent recovery state has not been yet reached.
2024-02-27 09:24:22.873 UTC [1604] FATAL:  the database system is not yet accepting connections
2024-02-27 09:24:22.873 UTC [1604] DETAIL:  Consistent recovery state has not been yet reached.
2024-02-27 09:24:23.094 UTC [1605] FATAL:  the database system is not yet accepting connections
2024-02-27 09:24:23.094 UTC [1605] DETAIL:  Consistent recovery state has not been yet reached.
2024-02-27 09:24:23.185 UTC [1606] FATAL:  the database system is not yet accepting connections
2024-02-27 09:24:23.185 UTC [1606] DETAIL:  Consistent recovery state has not been yet reached.
2024-02-27T09:24:23.185Z	ERROR	cmd/keeper.go:736	cannot get configured pg parameters	{"error": "pq: the database system is not yet accepting connections"}
2024-02-27 09:24:23.297 UTC [1607] FATAL:  the database system is not yet accepting connections
2024-02-27 09:24:23.297 UTC [1607] DETAIL:  Consistent recovery state has not been yet reached.
2024-02-27 09:24:23.499 UTC [1608] FATAL:  the database system is not yet accepting connections

Is there any body still work on this issue ?

Feb 27 '24 10:02 pirateunclejack

stolon stolon copied to clipboard

spec.CheckpointBeforePgrewind

stolon
stolon copied to clipboard