cloudberry icon indicating copy to clipboard operation
cloudberry copied to clipboard

[Bug] compression failed: Destination buffer is too small uncompressed len 32760

Open ryapandt opened this issue 2 months ago • 7 comments

Apache Cloudberry version

2.0

What happened

When performing SELECT, INSERT, or VACUUM operations on some tables, the error ​​"compression failed: Destination buffer is too small uncompressed len 32760"​​ occurs. The problematic tables have no obvious distinguishing features and may be of type AO_ROW or AO_COLUMN. Error logs : 2025-10-29 10:25:38.823383 CST,"gpadmin","hdb",p1838887,th1162283136,"172.18.1.37","61240",2025-10-29 10:25:35 CST,921113,con85,,seg24,,dx1852329,x921113,sx1,"PANIC","XX000","compression failed: Destination buffer is too small uncompressed len 32760 (xloginsert.c:891)",,,,,," vacuum full dwh.middle_store_item_require_price_history ",0,,"xloginsert.c",891,"Stack trace: 1 0x7fd13eb623d2 libpostgres.so errstart + 0x202 2 0x7fd13e42a7aa libpostgres.so <symbol not found> + 0x3e42a7aa 3 0x7fd13e60b051 libpostgres.so <symbol not found> + 0x3e60b051 4 0x7fd13e60b43d libpostgres.so XLogInsert + 0xdd 5 0x7fd13e58688c libpostgres.so heap_delete + 0xabc 6 0x7fd13e586bc7 libpostgres.so simple_heap_delete + 0x37 7 0x7fd13e639db4 libpostgres.so AppendOnlyVisimapStore_DeleteSegmentFile + 0x84 8 0x7fd13e64ce00 libpostgres.so AOCSCompact + 0x660 9 0x7fd13e78cee6 libpostgres.so ao_vacuum_rel + 0x386 10 0x7fd13e779ee0 libpostgres.so <symbol not found> + 0x3e779ee0 11 0x7fd13e77b5f1 libpostgres.so vacuum + 0xaa1 12 0x7fd13e77beac libpostgres.so ExecVacuum + 0x39c 13 0x7fd13ea014eb libpostgres.so standard_ProcessUtility + 0x8bb 14 0x7fd13765148a pax.so <symbol not found> + 0x3765148a 15 0x7fd13ea01b22 libpostgres.so ProcessUtility + 0xf2 16 0x7fd13e9fefa5 libpostgres.so <symbol not found> + 0x3e9fefa5 17 0x7fd13e9ff0f4 libpostgres.so <symbol not found> + 0x3e9ff0f4 18 0x7fd13e9ff863 libpostgres.so PortalRun + 0x2c3 19 0x7fd13e9f94c7 libpostgres.so <symbol not found> + 0x3e9f94c7 20 0x7fd13e9fce60 libpostgres.so PostgresMain + 0x20e0 21 0x7fd13e942112 libpostgres.so <symbol not found> + 0x3e942112 22 0x7fd13e9431f1 libpostgres.so PostmasterMain + 0xe61 23 0x4027eb postgres main (main.c:200) 24 0x7fd13d9ad7e5 libc.so.6 __libc_start_main + 0xe5 25 0x40298e postgres _start + 0x2e " 2025-10-29 10:25:41.004201 CST,,,p342006,th1162283136,,,,0,,,seg24,,,,,"LOG","00000","server process (PID 1838887) was terminated by signal 6: Aborted","Failed process was running: vacuum full dwh.middle_store_item_require_price_history ",,,,,,0,,"postmaster.c",4280, 2025-10-29 10:25:41.004246 CST,,,p342006,th1162283136,,,,0,,,seg24,,,,,"LOG","00000","terminating any other active server processes",,,,,,,0,,"postmaster.c",3995, 2025-10-29 10:25:41.004436 CST,,,p1838465,th1162283136,,,,0,,,seg24,,,,,"LOG","00000","ic-proxy: received signal 3",,,,,,,0,,"ic_proxy_main.c",484, 2025-10-29 10:25:41.004451 CST,,,p1838465,th1162283136,,,,0,,,seg24,,,,,"LOG","00000","ic-proxy: server closing",,,,,,,0,,"ic_proxy_main.c",585, 2025-10-29 10:25:41.004466 CST,,,p1838465,th1162283136,,,,0,,,seg24,,,,,"LOG","00000","ic-proxy: server closed with code 1",,,,,,,0,,"ic_proxy_main.c",599, 2025-10-29 10:25:41.005949 CST,,,p342006,th1162283136,,,,0,,,seg24,,,,,"LOG","00000","background worker ""ic proxy process"" (PID 1838465) exited with exit code 1",,,,,,,0,,"postmaster.c",4259, 2025-10-29 10:25:41.006996 CST,,,p342006,th1162283136,,,,0,,,seg24,,,,,"LOG","00000","all server processes terminated; reinitializing",,,,,,,0,,"postmaster.c",4571, 2025-10-29 10:25:41.139775 CST,,,p1838917,th1162283136,,,,0,,,seg24,,,,,"LOG","00000","database system was interrupted; last known up at 2025-10-29 10:22:33 CST",,,,,,,0,,"xlog.c",6822, 2025-10-29 10:25:41.139808 CST,,,p1838917,th1162283136,,,,0,,,seg24,,,,,"LOG","00000","Synchronization of the wal directory starts.",,,,,,,0,,"fd.c",3452, 2025-10-29 10:25:41.139932 CST,"gpadmin",,p1838918,th1162283136,"172.18.1.61","11270",2025-10-29 10:25:41 CST,0,,,seg24,,,,,"FATAL","57P03","the database system is in recovery mode","last replayed record at 0/0",,,,,,0,,"postmaster.c",2789, 2025-10-29 10:25:41.140242 CST,,,p1838917,th1162283136,,,,0,,,seg24,,,,,"LOG","00000","synchronization of the wal directory finishes.",,,,,,,0,,"fd.c",3454, 2025-10-29 10:25:41.140625 CST,,,p1838917,th1162283136,,,,0,,,seg24,,,,,"LOG","00000","database system was not properly shut down; automatic recovery in progress",,,,,,,0,,"xlog.c",7391, 2025-10-29 10:25:41.156145 CST,,,p1838917,th1162283136,,,,0,,,seg24,,,,,"LOG","00000","redo starts at DF/CF1900D0",,,,,,,0,,"xlog.c",7671, 2025-10-29 10:25:41.237241 CST,,,p1838917,th1162283136,,,,0,,,seg24,,,,,"LOG","00000","redo done at DF/DA26BA78 system usage: CPU: user: 0.04 s, system: 0.03 s, elapsed: 0.08 s",,,,,,,0,,"xlog.c",7959, 2025-10-29 10:25:41.263819 CST,,,p1838917,th1162283136,,,,0,,,seg24,,,,,"LOG","00000","end of transaction log location is DF/DA26BAB0",,,,,,,0,,"xlog.c",8048, 2025-10-29 10:25:41.266514 CST,,,p1838917,th1162283136,,,,0,,,seg24,,,,,"LOG","00000","checkpoint starting: end-of-recovery immediate",,,,,,,0,,"xlog.c",9239, 2025-10-29 10:25:41.279791 CST,,,p1838917,th1162283136,,,,0,,,seg24,,,,,"LOG","00000","checkpoint complete: wrote 268 buffers (0.8%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.005 s, sync=0.003 s, total=0.016 s; sync files=5, longest=0.002 s, average=0.001 s; distance=181120 kB, estimate=181120 kB",,,,,,,0,,"xlog.c",9320, 2025-10-29 10:25:41.279810 CST,,,p1838917,th1162283136,,,,0,,,seg24,,,,,"LOG","00000","latest completed transaction id is 921113 and next transaction id is 921114",,,,,,,0,,"xlog.c",8446, 2025-10-29 10:25:41.280054 CST,,,p1838917,th1162283136,,,,0,,,seg24,,,,,"LOG","00000","database system is ready",,,,,,,0,,"xlog.c",8473, 2025-10-29 10:25:41.286455 CST,,,p342006,th1162283136,,,,0,,,seg24,,,,,"LOG","00000","PostgreSQL 14.4 (Apache Cloudberry 2.0.0-incubating build 1) on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.2.1 20220127 (Red Hat 11.2.1-9), 64-bit compiled on Sep 17 2025 17:40:32 (with assert checking)",,,,,,,0,,"postmaster.c",3564, 2025-10-29 10:25:41.286473 CST,,,p342006,th1162283136,,,,0,,,seg24,,,,,"LOG","00000","database system is ready to accept connections","PostgreSQL 14.4 (Apache Cloudberry 2.0.0-incubating build 1) on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.2.1 20220127 (Red Hat 11.2.1-9), 64-bit compiled on Sep 17 2025 17:40:32 (with assert checking)",,,,,,0,,"postmaster.c",3566, 2025-10-29 10:25:41.286649 CST,,,p1838926,th1162283136,,,,0,,,seg24,,,,,"LOG","00000","ic-proxy: server setting up",,,,,,,0,,"ic_proxy_main.c",531, 2025-10-29 10:25:41.286949 CST,,,p1838926,th1162283136,,,,0,,,seg24,,,,,"LOG","00000","ic-proxy: server running",,,,,,,0,,"ic_proxy_main.c",572, 2025-10-29 10:25:46.036465 CST,"gpadmin",,p1838961,th1162283136,"172.18.1.61","11280",2025-10-29 10:25:46 CST,0,,,seg24,,,,,"LOG","00000","standby ""gp_walreceiver"" is now a synchronous standby with priority 1",,,,,,"START_REPLICATION SLOT ""internal_wal_replication_slot"" DF/D8000000 TIMELINE 11",0,,"syncrep.c",673, ^[[B^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A2025-10-29 10:26:13.319117 CST,,,p1838927,th1162283136,,,,0,,,seg24,,,,,"LOG","00000","activeWeight underflow!",,,,,,,0,,"backoff.c",912, 2025-10-29 10:26:14.320043 CST,,,p1838927,th1162283136,,,,0,,,seg24,,,,,"LOG","00000","activeWeight underflow!",,,,,,,0,,"backoff.c",912,

What you think should happen instead

No response

How to reproduce

None

Operating System

rocky 8.10

Anything else

No response

Are you willing to submit PR?

  • [ ] Yes, I am willing to submit a PR!

Code of Conduct

ryapandt avatar Oct 29 '25 02:10 ryapandt

Hi, @ryapandt welcome!🎊 Thanks for taking the time to point this out.🙌

github-actions[bot] avatar Oct 29 '25 02:10 github-actions[bot]

From the description, the problem is the buffer size is too small for compress xlog. There maybe two solution to solve it, 1.close wal_compression guc 2. adjust buffer size to bigger For the root cause, can give more details, like reproduce SQL?

zhangwenchao-123 avatar Oct 29 '25 03:10 zhangwenchao-123

The issue is not easily reproducible. After keeping the table structure unchanged and re-importing the same data, the problem disappeared.

From the description, the problem is the buffer size is too small for compress xlog. There maybe two solution to solve it, 1.close wal_compression guc 2. adjust buffer size to bigger For the root cause, can give more details, like reproduce SQL?

ryapandt avatar Oct 29 '25 04:10 ryapandt

From the description, the problem is the buffer size is too small for compress xlog. There maybe two solution to solve it, 1.close wal_compression guc 2. adjust buffer size to bigger For the root cause, can give more details, like reproduce SQL?

1.close wal_compression guc this worked 2. adjust buffer size to bigger how to adjust?

ryapandt avatar Oct 29 '25 04:10 ryapandt

From the description, the problem is the buffer size is too small for compress xlog. There maybe two solution to solve it, 1.close wal_compression guc 2. adjust buffer size to bigger For the root cause, can give more details, like reproduce SQL?

1.close wal_compression guc this worked 2. adjust buffer size to bigger how to adjust?

The 2 ways, you can change wal_block_size guc

zhangwenchao-123 avatar Oct 29 '25 06:10 zhangwenchao-123

@ryapandt Could you paste the stack trace of the core file? It seems like zstd fails to compress and the log message is elevated to PANIC.

gfphoenix78 avatar Nov 10 '25 08:11 gfphoenix78

The destination buffer size is hard-coded, which is not changed at runtime.

gfphoenix78 avatar Nov 11 '25 01:11 gfphoenix78

@ryapandt Do you use this version 2.0.0-incubating-rc1 ? I think this issue is fixed by this PR 55d2cacddbcfa1363a948e82fea3ded8fe247a86.

HuSen8891 avatar Dec 01 '25 03:12 HuSen8891