[Bug] Enabling wal_compression Leads To Coredumps
Apache Cloudberry version
Cloudberry 1.6.0 (pre apache release)
What happened
A few days ago we set wal_compression = on in an attempt to reduce IO in our production cluster. Shortly after enabling this, we had users reaching out saying their queries that we part of a big workload were failing. After some investigation, we saw some coredumps being generated on the segments that were throwing errors and these coredumps are directly related to the wal compression functionality. It seems the exception was thrown right after XLogCompressBackupBlock.cold.4 tried running and created a coredump. Thankfully it didn't crash any segments so I imagine the WAL stuff happens in it's own thread. We quickly disabled this GUC and haven't seen this issue again (it happened on multiple segments multiple times since they were running retries on their jobs)
Client Side Error
DEBUG ERROR: Error on receive from seg25 slice1 10.
Coredump Trace
#0 0x00007f665f04e52f in raise () from /lib64/libc.so.6
#1 0x00007f665f021e65 in abort () from /lib64/libc.so.6
#2 0x00007f66600de060 in errfinish () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#3 0x00007f665fa84888 in XLogCompressBackupBlock.cold.4 () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#4 0x00007f665fbfc814 in XLogRecordAssemble () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#5 0x00007f665fbfcbc4 in XLogInsert_Internal () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#6 0x00007f665fb9364f in heap_delete () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#7 0x00007f665fb93836 in simple_heap_delete () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#8 0x00007f665fb5c81c in toast_delete_datum () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#9 0x00007f665fbd2a5f in toast_delete_external () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#10 0x00007f665fba1070 in heap_toast_delete () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#11 0x00007f665fb9343d in heap_delete () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#12 0x00007f665fda4298 in ExecDelete () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#13 0x00007f665fda62f1 in ExecModifyTable () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#14 0x00007f665fd7877b in ExecProcNodeFirst () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#15 0x00007f665fd6f47a in ExecutePlan.part.1 () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#16 0x00007f665fd6ff28 in standard_ExecutorRun () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#17 0x00007f665fd70135 in ExecutorRun () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#18 0x00007f665ff8af2d in ProcessQuery.isra.3 () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#19 0x00007f665ff8beb2 in PortalRunMulti () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#20 0x00007f665ff8c33d in PortalRun () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#21 0x00007f665ff865df in exec_mpp_query () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#22 0x00007f665ff89ebd in PostgresMain () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#23 0x00007f665fee5ddf in ServerLoop () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#24 0x00007f665fee6f1f in PostmasterMain () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#25 0x00000000004017ae in main ()```
### What you think should happen instead
Shouldn't core dump :)
### How to reproduce
I haven't tried creating a test case for this yet but it should be relatively easy. All we did was enable the guc, run `gpstop -u`, and then our users started having issues.
### Operating System
Rocky Linux 8.10 (Green Obsidian)
### Anything else
_No response_
### Are you willing to submit PR?
- [ ] Yes, I am willing to submit a PR!
### Code of Conduct
- [x] I agree to follow this project's [Code of Conduct](https://github.com/apache/cloudberry/blob/main/CODE_OF_CONDUCT.md).
related https://github.com/apache/cloudberry/issues/806
In PostgreSQL, if elog(ERROR) is called inside a critical section, it will be automatically escalated to a PANIC.
In the WAL write path, PostgreSQL uses a CriticalSection mechanism:
START_CRIT_SECTION();
|
→ XLogInsert_Internal()
→ XLogRecordAssemble()
→ XLogCompressBackupBlock()
→ elog(ERROR, "compression failed")
|
→ errstart(ERROR)
→ errfinish()
→ if (CritSectionCount > 0)
elevel = PANIC;
→ exit(1)
→ abort()
END_CRIT_SECTION();
This means that even if you use elog(ERROR), if the function is executed between START_CRIT_SECTION() and END_CRIT_SECTION(), PostgreSQL will automatically treat the error as PANIC. This escalation will call abort() and forcibly terminate the process, in order to prevent the database from continuing in a potentially corrupted state.