stolon
stolon copied to clipboard
spec.CheckpointBeforePgrewind
Provide a configuration parameter that enables issuing of a CHECKPOINT prior to starting a pg_rewind, ensuring the control file is up-to-date with the latest timeline information before the rewind process starts.
When this setting is disabled, it's possible for a newly promoted master to have a control file with a timeline from before the promotion occured. If a follower tries to resync using rewind against this database it will quit without doing anything as it will see no timeline deviation when it reads the control file, stating "no rewind required".
Some users run large databases where the cost of a basebackup is many times more expensive than rewind. These users would probably trade the performance impact of immediate checkpointing for faster recovery of the standby, especially for those using synchronous replication who may be unable to accept writes until the standby is recovered.
For now, this change will at least enable a "rewind at all costs" option as opposed to the "maybe rewind if possible" that existed before. Future work might support a retry timeout for rewind operations, for those users who don't want the performance hit but do value rewinding over basebackups. We could also use the backup API to trigger a checkpoint with the target Postgres spread configuration, then retry rewinding until the checkpoint_timeout has elapsed.
I think this is in a good place now. I've been running these tests like so:
$ STCTL_BIN="$(pwd)/bin/stolonctl" STKEEPER_BIN="$(pwd)/bin/stolon-keeper" STSENTINEL_BIN="$(pwd)/bin/stolon-sentinel" STDCTL_BIN="$(pwd)/bin/stolonctl" STPROXY_BIN="$(pwd)/bin/stolon-proxy" STOLON_TEST_STORE_BACKEND=etcdv3 ETCD_BIN=$(which etcd) go test tests/integration/*.go -v -test.run=TestTimelineForkPgRewindCheckpoint -timeout=3m
If you build the stolon binaries with master (which will silently ignore the CheckpointBeforePgrewind
configuration parameter) then you'll see the following failure:
utils.go:260: [keeper fdc71fc5]: 2019-06-12T17:26:45.669+0100 INFO cmd/keeper.go:824 syncing using pg_rewind {"followedDB": "d6a67b88", "keeper": "e3414a18"}
utils.go:260: [keeper fdc71fc5]: 2019-06-12T17:26:45.669+0100 INFO postgresql/postgresql.go:789 running pg_rewind
ha_test.go:1275: keeper tried rewinding but rewind thought it was not required: fdc71fc5
utils.go:260: [keeper fdc71fc5]: fetched file "global/pg_control", length 8192
utils.go:260: [keeper fdc71fc5]: source and target cluster are on the same timeline
utils.go:260: [keeper fdc71fc5]: no rewind required
Running with binaries built against this branch, the tests pass and the log output looks like this:
ha_test.go:1213: Waiting 2 * sleepInterval for checkpoint configuration to be applied
ha_test.go:1216: Creating lots of un-CHECKPOINT'd data
ha_test.go:1221: Write token 1 to master: f0333453
ha_test.go:1234: SIGSTOPping current standby keeper: 17d6df63
ha_test.go:1238: SIGSTOPping current standby wal receivers: 17d6df63
ha_test.go:1243: Writing token 2 to master, advancing current timeline: f0333453
ha_test.go:1248: Stopping master: f0333453
ha_test.go:1255: SIGKILLing current standby wal receivers: 17d6df63
ha_test.go:1261: SIGCONTing current standby keeper: 17d6df63
ha_test.go:1267: Waiting for new master to assume role: 17d6df63
ha_test.go:1272: Write token 3 to new master, advancing current timeline: 17d6df63
ha_test.go:1279: Starting old master and verifying it successfully rewinds: f0333453
utils.go:260: [keeper f0333453]: 2019-06-12T17:59:57.228+0100 INFO cmd/keeper.go:844 syncing using pg_rewind {"followedDB": "51bca7ec", "keeper": "17d6df63", "forcingCheckpoint": true}
utils.go:260: [keeper f0333453]: 2019-06-12T17:59:57.229+0100 INFO postgresql/postgresql.go:804 issuing checkpoint on primary
utils.go:260: [keeper 17d6df63]: 2019-06-12 17:59:57.293 BST [23830] LOG: checkpoint complete: wrote 1671 buffers (10.2%); 0 WAL file(s) added, 0 removed, 0 recycled; write=4.847 s, sync=0.014 s, total=4.863 s; sync files=335, longest=0.000 s, average=0.000 s; distance=34075 kB, estimate=34075 kB
utils.go:260: [keeper 17d6df63]: 2019-06-12 17:59:57.295 BST [23830] LOG: checkpoint starting: immediate force wait
utils.go:260: [keeper 17d6df63]: 2019-06-12 17:59:57.296 BST [23830] LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.001 s; sync files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=30668 kB
utils.go:260: [keeper f0333453]: CHECKPOINT
utils.go:260: [keeper f0333453]: 2019-06-12T17:59:57.351+0100 INFO postgresql/postgresql.go:815 running pg_rewind
utils.go:260: [keeper f0333453]: fetched file "global/pg_control", length 8192
utils.go:260: [keeper f0333453]: fetched file "pg_wal/00000002.history", length 41
utils.go:260: [keeper f0333453]: Source timeline history:
utils.go:260: [keeper f0333453]: Target timeline history:
utils.go:260: [keeper f0333453]: 1: 0/0 - 0/0
utils.go:260: [keeper f0333453]: servers diverged at WAL location 0/4146FE0 on timeline 1
utils.go:260: [keeper f0333453]: rewinding from last common checkpoint at 0/2000060 on timeline 1
ha_test.go:1304: Waiting for the old master to receive token 3 from forked timeline: f0333453
This actually addresses #601, though not via any of the remediations recommended in that issue.
I've also experienced this issue. @sgotti why wal_keep_segments is hard-coded to such small value as 8 and can I increase it manually in code as a temporary workaround of this problem?
wal_keep_segments
how to change wal_keep_segments?
Hey @rearden-steel @frank111, I'm not sure how this is related to the wal keep segments GUC? This change is about recovering from forked timelines rather than having insufficient wal to perform a recovery.
can I increase it manually in code
We've done so in our stolon fork as we also needed a higher value than 8. It's safe to do so, but I'd recommend pushing https://github.com/sorintlab/stolon/pull/591 so we can get this solved for everyone.
Hey @lawrencejones! Is there any chance to make the requested changes? We experience issues which should be fixed with this PR.
Hey @rearden-steel, I'm.midway through making this work. It needs integration tests which are what I've been working on, and to provide a configurable variable that enables/disables checkpointing.
I'm out at a conference today/tomorrow but this is right on my to-do list come Thursday. I'm going to push the WIP changes now so you can see them, and caveat this update by saying I'm explicitly ignoring Simone's advice until I've got the tests working (at which point I'll action them).
From where I got to last time, it's difficult to simulate a Postgres that takes a while to checkpoint after a promotion. While pg_ctl promote won't force a synchronous checkpoint it will schedule one, and in the test databases that scheduled checkpoint is almost instantaneous leading to the controlfile being updated immediately.
Hopefully you can wait a couple of days for me to get back to this?
it's difficult to simulate a Postgres that takes a while to checkpoint after a promotion
I was able to reach such situation when doing manual failover while running pgbench (-c 100 -T 120 -j 20) on a cluster (with pre-generated database around 80 GB, pgbench -i -s 5000).
Hopefully you can wait a couple of days for me to get back to this?
Sure! Thank you for your work!
@sgotti @rearden-steel this is ready for review now.
Instead of adding this logic into an existing test, I've created another test with just two keepers that verifies we're safe from this specific race. The details of that test can be seen here: https://github.com/sorintlab/stolon/blob/bd61324616e4e501ec511e10f27e2c09fac8e130/tests/integration/ha_test.go#L1174-L1187
I've added some test helpers and I'm going to go through the diff and comment where things may not be totally clear.
Any questions, please do ask!
Hey @sgotti! I've been rubbish here: PRs were left hanging while I went on holiday 🏖
I'm going to reply to each bit of your review separately. Hopefully we can reason our way to a conclusion!
Beside the inline comments, one thing that I'd like to avoid at all costs is issuing "write" commands from one keeper to another instance since it looks quite dirty (against the fact that the postgres should be managed only by its keeper and, in this case, every resync will issue an unneeded checkpoint).
So, shouldn't be possible to just issue a checkpoint after a promote by the primary keeper?
Postgres will automatically begin a checkpoint whenever a promote happens, but it will do so in a spread fashion (ie, using all the checkpointing configuration parameters to ensure it throttles the checkpointing). If we were to get the keeper to issue a checkpoint itself after promotion (optionally, via a similar configuration parameter as this PR) then we'd force the master to checkpoint faster, which is good but...
At the same time as the master checkpoints, our slaves will attempt to resync against the master. Until the checkpoint completes the pg_rewind could fail, and we'll fallback to pg_basebackup, which is what we really want to avoid.
If we really want to avoid this, then a possible solution would be to provide this checkpoint-after-promote functionality we've been discussing, but additionally have keepers retry pg_rewind for a specific time threshold. A user could then set the retry timeout to be 5m, for example, and enable checkpoint-after-promote to ensure the rewind will be retried after the checkpoint is expected to be completed.
The main issue I see here is that there'll be a window between issuing a promote and finishing the checkpoint where the keeper already reported its state to the sentinel and it has updated the old primary follow config.
One possible solution to fix this and also other realted issues will be to report to the sentinel the current checkpoint status after promote or the checkpoint timeline (new field in cluster.go DBStatus) so the sentinel will set the old primary followConfig only when it's sure it's in good state for a pg_rewind.
It's quite sad for me that we have to do this, if you know what I mean? pg_basebackup has an option to wait for a checkpoint to complete before starting, so it's quite sad to me that pg_rewind doesn't also have this option. Perhaps we can simulate it by creating a backup label, as that appears to be what pg_basebackup does internally? Do you have any thoughts on this?
@sgotti I know you're probably busy on a lot of other things right now, but would really appreciate your feedback for this change. It matters quite a lot to us that rewinds are working, and this change makes that the case.
If we want to take another path that's great, but we'll need to know what it is before we can get implementing it!
@lawrencejones Also if theoretically I'd prefer a solution that doesn't require doing operations from a keeper to another, in the end this is already done when doing a pg_basebackup or a pg_rewind.
So I don't want to stop this PR trying to find a cleanliness that's not really possible. Lets go ahead with your solution! (in the end we are just doing what pg_rewind should do: a checkpoint inside pg_rewind like pg_basebackup already does)
P.S. We should report this upstream if not already done.
Hey @sgotti!
So I don't want to stop this PR trying to find a cleanliness that's not really possible. Lets go ahead with your solution! (in the end we are just doing what pg_rewind should do: a checkpoint inside pg_rewind like pg_basebackup already does)
Makes sense to me :)
I've adjusted the doc comment to make it clearer we'll be hitting the primary. All your other review comments were questions/have been addressed, I think.
Can I get another look? Thanks.
Hi Team, wondering any updates on this? Anything I can help with to get this merged?
@viggy28 If no one is going to work in this PR (@lawrencejones) feel free to adopt and update it.
Hey @viggy28, if you want to take this over then please let me know. I’m not working on stolon things right now so if you need this, I’ll be happy to hand it over!
Sure @lawrencejones. Looks like you have done most of the work. Thanks much.
Trying to understand what is needed.
- https://github.com/sorintlab/stolon/pull/644/files#r293416129
- https://github.com/sorintlab/stolon/pull/644#pullrequestreview-249368918 is unresolved but @sgotti also mentioned here https://github.com/sorintlab/stolon/pull/644#issuecomment-524313267 that shouldn't be a blocker. which means can we resolve that?
- Looks like the build is failing.
Anything else I am missing?
Dear contributors , We have a stolon cluster which is such big to cost hours to recovery with basebackup. I apply this change to latest master and run on our test environment . the standby keepers keep failed after pg_rewind done :
pg_rewind: Done!
2024-02-27T09:24:12.615Z INFO cmd/keeper.go:928 syncing using pg_rewind {"masterDB": "1f5bbc57", "keeper": "keeper0"}
2024-02-27T09:24:12.636Z INFO postgresql/postgresql.go:921 issuing checkpoint on primary
CHECKPOINT
2024-02-27T09:24:12.993Z INFO postgresql/postgresql.go:932 running pg_rewind
2024-02-27T09:24:12.994Z DEBUG postgresql/postgresql.go:936 execing cmd {"cmd": "/usr/lib/postgresql/14/bin/pg_rewind --debug -D /stolon-data/postgres --source-server=application_name=stolon_7805b990 dbname=zabbix host=172.16.250.153 options=-c\\ synchronous_commit=off password=******************** port=5432 sslmode=prefer user=stolon"}
pg_rewind: executing "/usr/lib/postgresql/14/bin/postgres" for target server to complete crash recovery
2024-02-27 09:24:13.068 GMT [1590] LOG: skipping missing configuration file "/stolon-data/postgres/postgresql.auto.conf"
2024-02-27T09:24:13.173Z ERROR cmd/keeper.go:736 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2024-02-27T09:24:15.675Z ERROR cmd/keeper.go:736 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2024-02-27 09:24:16.349 UTC [1590] LOG: database system was interrupted while in recovery at log time 2024-02-27 09:23:11 UTC
2024-02-27 09:24:16.349 UTC [1590] HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2024-02-27 09:24:16.455 UTC [1590] LOG: redo starts at B/280161F8
2024-02-27 09:24:16.556 UTC [1590] LOG: invalid record length at B/28182E08: wanted 24, got 0
2024-02-27 09:24:16.556 UTC [1590] LOG: redo done at B/28182DE0 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.10 s
PostgreSQL stand-alone backend 14.0 (Debian 14.0-1.pgdg110+1)
2024-02-27T09:24:17.548Z DEBUG postgresql/postgresql.go:625 execing cmd {"cmd": "/usr/lib/postgresql/14/bin/postgres -V"}
2024-02-27T09:24:17.561Z DEBUG cmd/keeper.go:815 pgstate dump: (*cluster.PostgresState)(0xc0008f84d0)({
UID: (string) (len=8) "7805b990",
Generation: (int64) 0,
ListenAddress: (string) (len=14) "172.16.115.237",
Port: (string) (len=4) "5432",
Healthy: (bool) false,
SystemID: (string) "",
TimelineID: (uint64) 0,
XLogPos: (uint64) 0,
TimelinesHistory: (cluster.PostgresTimelinesHistory) <nil>,
PGParameters: (common.Parameters) <nil>,
SynchronousStandbys: ([]string) <nil>,
OlderWalFile: (string) ""
})
2024-02-27T09:24:18.177Z ERROR cmd/keeper.go:736 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
pg_rewind: fetched file "global/pg_control", length 8192
pg_rewind: source and target cluster are on the same timeline
pg_rewind: no rewind required
2024-02-27T09:24:18.222Z DEBUG postgresql/postgresql.go:625 execing cmd {"cmd": "/usr/lib/postgresql/14/bin/postgres -V"}
2024-02-27T09:24:18.324Z INFO postgresql/postgresql.go:844 writing standby signal file
2024-02-27T09:24:18.368Z INFO postgresql/postgresql.go:319 starting database
2024-02-27T09:24:18.368Z DEBUG postgresql/postgresql.go:323 execing cmd {"cmd": "/usr/lib/postgresql/14/bin/postgres -D /stolon-data/postgres -c unix_socket_directories=/tmp"}
2024-02-27T09:24:20.682Z ERROR cmd/keeper.go:736 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
2024-02-27 09:24:21.540 UTC [1593] LOG: starting PostgreSQL 14.0 (Debian 14.0-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2024-02-27 09:24:21.540 UTC [1593] LOG: listening on IPv4 address "172.16.115.237", port 5432
2024-02-27 09:24:21.589 UTC [1593] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2024-02-27 09:24:22.064 UTC [1594] LOG: database system was shut down at 2024-02-27 09:24:17 UTC
2024-02-27 09:24:22.064 UTC [1595] FATAL: the database system is starting up
2024-02-27 09:24:22.085 UTC [1594] LOG: entering standby mode
2024-02-27 09:24:22.221 UTC [1594] LOG: invalid record length at B/28182EF8: wanted 24, got 0
2024-02-27 09:24:22.245 UTC [1598] LOG: started streaming WAL from primary at B/28000000 on timeline 31
2024-02-27 09:24:22.267 UTC [1599] FATAL: the database system is not yet accepting connections
2024-02-27 09:24:22.267 UTC [1599] DETAIL: Consistent recovery state has not been yet reached.
2024-02-27 09:24:22.271 UTC [1594] LOG: invalid resource manager ID 48 at B/28182EF8
2024-02-27 09:24:22.279 UTC [1598] FATAL: terminating walreceiver process due to administrator command
2024-02-27 09:24:22.279 UTC [1594] LOG: invalid resource manager ID 48 at B/28182EF8
2024-02-27 09:24:22.280 UTC [1594] LOG: invalid resource manager ID 48 at B/28182EF8
2024-02-27 09:24:22.469 UTC [1601] FATAL: the database system is not yet accepting connections
2024-02-27 09:24:22.469 UTC [1601] DETAIL: Consistent recovery state has not been yet reached.
2024-02-27T09:24:22.583Z DEBUG postgresql/postgresql.go:625 execing cmd {"cmd": "/usr/lib/postgresql/14/bin/postgres -V"}
2024-02-27T09:24:22.593Z DEBUG cmd/keeper.go:815 pgstate dump: (*cluster.PostgresState)(0xc0006ac630)({
UID: (string) (len=8) "7805b990",
Generation: (int64) 0,
ListenAddress: (string) (len=14) "172.16.115.237",
Port: (string) (len=4) "5432",
Healthy: (bool) false,
SystemID: (string) "",
TimelineID: (uint64) 0,
XLogPos: (uint64) 0,
TimelinesHistory: (cluster.PostgresTimelinesHistory) <nil>,
PGParameters: (common.Parameters) <nil>,
SynchronousStandbys: ([]string) <nil>,
OlderWalFile: (string) ""
})
2024-02-27 09:24:22.671 UTC [1603] FATAL: the database system is not yet accepting connections
2024-02-27 09:24:22.671 UTC [1603] DETAIL: Consistent recovery state has not been yet reached.
2024-02-27 09:24:22.873 UTC [1604] FATAL: the database system is not yet accepting connections
2024-02-27 09:24:22.873 UTC [1604] DETAIL: Consistent recovery state has not been yet reached.
2024-02-27 09:24:23.094 UTC [1605] FATAL: the database system is not yet accepting connections
2024-02-27 09:24:23.094 UTC [1605] DETAIL: Consistent recovery state has not been yet reached.
2024-02-27 09:24:23.185 UTC [1606] FATAL: the database system is not yet accepting connections
2024-02-27 09:24:23.185 UTC [1606] DETAIL: Consistent recovery state has not been yet reached.
2024-02-27T09:24:23.185Z ERROR cmd/keeper.go:736 cannot get configured pg parameters {"error": "pq: the database system is not yet accepting connections"}
2024-02-27 09:24:23.297 UTC [1607] FATAL: the database system is not yet accepting connections
2024-02-27 09:24:23.297 UTC [1607] DETAIL: Consistent recovery state has not been yet reached.
2024-02-27 09:24:23.499 UTC [1608] FATAL: the database system is not yet accepting connections
Is there any body still work on this issue ?