Fix segfault when recovering 2pc transaction
When promoting a mirror segment due to failover we have seen a stacktrace like this:
"FATAL","58P01","requested WAL segment pg_xlog/00000023000071D50000001F has already been removed",,,,,,,0,,"xlogutils.c",580,"Stack
trace:
1 0x557bdb9f09b6 postgres errstart + 0x236
2 0x557bdb5fc6cf postgres <symbol not found> + 0xdb5fc6cf
3 0x557bdb5fd021 postgres read_local_xlog_page + 0x191
4 0x557bdb5fb922 postgres <symbol not found> + 0xdb5fb922
5 0x557bdb5fba11 postgres XLogReadRecord + 0xa1
6 0x557bdb5e7767 postgres RecoverPreparedTransactions + 0xd7
7 0x557bdb5f608b postgres StartupXLOG + 0x2a3b
8 0x557bdb870a89 postgres StartupProcessMain + 0x139
9 0x557bdb62f489 postgres AuxiliaryProcessMain + 0x549
10 0x557bdb86d275 postgres <symbol not found> + 0xdb86d275
11 0x557bdb8704e3 postgres PostmasterMain + 0x1213
12 0x557bdb56a1f7 postgres main + 0x497
13 0x7fded4a61c87 libc.so.6 __libc_start_main + 0xe7
Note: stacktrace is from one of our production GP clusters and might be slightly different from what we will see in cloudberry, but the failure is still present here as well. Testcase proves it.
PG13 and PG14 have a fix for this bug, but it's doesn't have any test case and looks like we didn't cherry-pick that far. The discussion can be found here: https://www.postgresql.org/message-id/flat/743b9b45a2d4013bd90b6a5cba8d6faeb717ee34.camel%40cybertec.at
In a few words, StartupXLOG() renames the last wal segment to .partial but tries to read it by the old name later in RecoverPreparedTransactions().
The fix is mostly borrowed from PG14 postgres/postgres@f663b00 with some cloudberry-related exceptions. Also added a regression test which segfaults without this fix for any version of GP, PG<=12 or Cloudberry.
Fixes #ISSUE_Number
What does this PR do?
Type of Change
- [ ] Bug fix (non-breaking change)
- [ ] New feature (non-breaking change)
- [ ] Breaking change (fix or feature with breaking changes)
- [ ] Documentation update
Breaking Changes
Test Plan
- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Passed
make installcheck - [ ] Passed
make -C src/test installcheck-cbdb-parallel
Impact
Performance:
User-facing changes:
Dependencies:
Checklist
- [ ] Followed contribution guide
- [ ] Added/updated documentation
- [ ] Reviewed code for security implications
- [ ] Requested review from cloudberry committers
Additional Context
CI Skip Instructions
Hi team, as this PR somewhat reorders the process of xlog startup, I need reviewers to carefully double-check I didn't mess something up here. Especially I have no knowledge how Startup hook is used, perhaps it cannot be moved up this way. The issue itself shows up quite often in our GP installations and from what I understand it's not fixed yet in cloudberry as well.
I see that my test is failing. Didn't fail locally the last time I tried. Will take a look Added ordering for test stability, but still need to figure out why it crashes - it doesn't on my local build.
Nice catch! @Smyatkin-Maxim it's pax related flake test. So I will retry ci.
Nice catch! @Smyatkin-Maxim it's pax related flake test. So I will retry ci.
on the first CI run there was a crash+coredump on my testcase. I'm gonna be rerunning IC2 tests a couple times, feels like there is yet another rarely reproducible bug, which my testcase (or my fix?) uncovers.
UPD: after a couple reruns of isolation2 suite I still didn't get the core dump I've seen on the first run.
@jiaqizho @gongxun0928
1 | lock_tbl1a
1 | lock_view2
1 | lock_view3
+ 1 | pg_pax_blocks_4[33](https://github.com/apache/cloudberry/actions/runs/14729599874/job/41419204900?pr=1078#step:18:34)56
2 | lock_tbl1
2 | lock_tbl1a
2 | lock_view2
this pax related.
@jiaqizho @gongxun0928
1 | lock_tbl1a 1 | lock_view2 1 | lock_view3 + 1 | pg_pax_blocks_4[33](https://github.com/apache/cloudberry/actions/runs/14729599874/job/41419204900?pr=1078#step:18:34)56 2 | lock_tbl1 2 | lock_tbl1a 2 | lock_view2this pax related.
retry pls. not sure which case will cause this lock.
@jiaqizho @gongxun0928
1 | lock_tbl1a 1 | lock_view2 1 | lock_view3 + 1 | pg_pax_blocks_4[33](https://github.com/apache/cloudberry/actions/runs/14729599874/job/41419204900?pr=1078#step:18:34)56 2 | lock_tbl1 2 | lock_tbl1a 2 | lock_view2this pax related.
The test should be fixed in the latest main branch.
Thanks for your PR, @Smyatkin-Maxim, do you mind to cherry-pick the commit that fixes this issue in pg14? So, the context and commit message are kept.