cloudberry icon indicating copy to clipboard operation
cloudberry copied to clipboard

[Bug] Run sql failed,report server closed the connection unexpectedly

Open cleverxiao001 opened this issue 5 months ago • 10 comments

Apache Cloudberry version

2.0.0-incubating-rc2

What happened

table structure: create table reference_info ( uid text, country text, ref_uid text, ref_country text, ref_year varchar(32) ) with (appendoptimized = true, orientation = column, compresstype = zstd,compresslevel = 5) distributed by (uid); There are 20 million data entries in the database table. I run sql: select t.country, t.ref_year, ref_quantity, ref_frequency from (select country, ref_year, count() as ref_frequency from reference_info group by country, ref_year) t join (select country, ref_year, count() as ref_quantity from (select country, ref_year, ref_uid from reference_info group by country, ref_year, ref_uid) t group by country, ref_year) u on t.country = u.country and t.ref_year = u.ref_year; or select ref_country, ref_year, count(distinct ref_uid) as cited_quantity, count() as cited_frequency from reference_info where country = 'USA' group by ref_country, ref_year; is all worked.But select country, ref_year, count() as ref_frequency, count(distinct ref_uid) as ref_quantity from reference_info group by country, ref_year; is failed ,log is: 2025-07-22 21:37:15.905016 EDT,"gpadmin","db",p4122615,th294045888,"192.168.23.218","33828",2025-07-22 21:27:15 EDT,0,con3154,cmd2,seg-1,,dx1344140,,sx1,"ERROR","58M01","Error on receive from seg7 slice1 192.168.18.164:6007 pid=4123699: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request.",,,,,,,0,,"cdbdispatchresult.c",510,"Stack trace: 1 0x7fab13df44a8 libpostgres.so errstart + 0x258 2 0x7fab13dfa3f6 libpostgres.so ThrowErrorData + 0x16 3 0x7fab13a4919f libpostgres.so mppExecutorFinishup + 0xdf 4 0x7fab13a37b2e libpostgres.so standard_ExecutorEnd + 0xee 5 0x7fab139cc629 libpostgres.so PortalCleanup + 0x79 6 0x7fab13e3973b libpostgres.so PortalDrop + 0x3b 7 0x7fab13e39de5 libpostgres.so PreCommit_Portals + 0x125 8 0x7fab13889e5c libpostgres.so + 0x13889e5c 9 0x7fab1388bd45 libpostgres.so CommitTransactionCommand + 0x2d5 10 0x7fab13c8f007 libpostgres.so PostgresMain + 0x1327 11 0x7fab13bd5b62 libpostgres.so + 0x13bd5b62 12 0x7fab13bd6c41 libpostgres.so PostmasterMain + 0xe61 13 0x4027eb postgres main (main.c:200) 14 0x7fab12e295d0 libc.so.6 + 0x12e295d0 15 0x7fab12e29680 libc.so.6 __libc_start_main + 0x80 16 0x402985 postgres _start + 0x25 "

What you think should happen instead

No response

How to reproduce

select country, ref_year, count(*) as ref_frequency, count(distinct ref_uid) as ref_quantity from reference_info group by country, ref_year;

Operating System

rocky 9.6

Anything else

No response

Are you willing to submit PR?

  • [ ] Yes, I am willing to submit a PR!

Code of Conduct

cleverxiao001 avatar Jul 23 '25 02:07 cleverxiao001

Hi, @cleverxiao001 welcome!🎊 Thanks for taking the time to point this out.🙌

github-actions[bot] avatar Jul 23 '25 02:07 github-actions[bot]

is all worked.But select country, ref_year, count() as ref_frequency, count(distinct ref_uid) as ref_quantity from reference_info group by country, ref_year; is failed ,log is:

Hi, I create some random data, but couldn't reproduce that. The log is on master, so there is no much info. Could you try to explain the SQL to see if it will crash? and show the plan if no crash

explain(verbose)

explain(verbose) select country, ref_year, count() as ref_frequency, count(distinct ref_uid) as ref_quantity from reference_info group by country, ref_year;

and explain(analyze)

explain(verbose, analyze) select country, ref_year, count() as ref_frequency, count(distinct ref_uid) as ref_quantity from reference_info group by country, ref_year;

avamingli avatar Jul 24 '25 14:07 avamingli

explain(verbose) output: QUERY PLAN Gather Motion 36:1 (slice1; segments: 36) (cost=0.00..431.00 rows=1 width=32) " Output: country, ref_year, (count()), (count(DISTINCT ref_uid))" -> GroupAggregate (cost=0.00..431.00 rows=1 width=32) " Output: country, ref_year, count(), count(DISTINCT ref_uid)" " Group Key: reference_info.country, reference_info.ref_year" -> Sort (cost=0.00..431.00 rows=1 width=24) " Output: country, ref_uid, ref_year" " Sort Key: reference_info.country, reference_info.ref_year" -> Redistribute Motion 36:36 (slice2; segments: 36) (cost=0.00..431.00 rows=1 width=24) " Output: country, ref_uid, ref_year" " Hash Key: country, ref_year" -> Seq Scan on reference_info (cost=0.00..431.00 rows=1 width=24) " Output: country, ref_uid, ref_year" Optimizer: GPORCA

explain(verbose,analyze) output: QUERY PLAN Gather Motion 36:1 (slice1; segments: 36) (cost=0.00..431.00 rows=1 width=32) (actual time=72085.217..115653.745 rows=2480 loops=1) " Output: country, ref_year, (count()), (count(DISTINCT ref_uid))" -> GroupAggregate (cost=0.00..431.00 rows=1 width=32) (actual time=69972.797..88910.012 rows=92 loops=1) " Output: country, ref_year, count(), count(DISTINCT ref_uid)" " Group Key: reference_info.country, reference_info.ref_year" -> Sort (cost=0.00..431.00 rows=1 width=24) (actual time=69832.795..71977.819 rows=8223755 loops=1) " Output: country, ref_uid, ref_year" " Sort Key: reference_info.country, reference_info.ref_year" Sort Method: quicksort Memory: 122481kB Max Memory: 62527kB Avg Memory: 61240kB (2 segments) Sort Method: external merge Disk: 3588512kB Max Memory: 377088kB Avg Memory: 105544kB (34 segments) work_mem: 2236160kB Segments: 36 Max: 63854kB (segment 1) Workfile: (34 spilling) -> Redistribute Motion 36:36 (slice2; segments: 36) (cost=0.00..431.00 rows=1 width=24) (actual time=3.000..66710.760 rows=8223755 loops=1) " Output: country, ref_uid, ref_year" " Hash Key: country, ref_year" -> Seq Scan on reference_info (cost=0.00..431.00 rows=1 width=24) (actual time=5.000..2059.006 rows=2290316 loops=1) " Output: country, ref_uid, ref_year" Planning Time: 13.835 ms (slice0) Executor memory: 117K bytes. " (slice1) Executor memory: 106412K bytes avg x 36x(0) workers, 141357K bytes max (seg23). Work_mem: 63854K bytes max." " (slice2) Executor memory: 530K bytes avg x 36x(0) workers, 530K bytes max (seg0)." Memory used: 128000kB Optimizer: GPORCA Execution Time: 115751.813 ms

Both of them is worked

cleverxiao001 avatar Jul 25 '25 02:07 cleverxiao001

explain(verbose) output: QUERY PLAN Gather Motion 36:1 (slice1; segments: 36) (cost=0.00..431.00 rows=1 width=32) " Output: country, ref_year, (count()), (count(DISTINCT ref_uid))" -> GroupAggregate (cost=0.00..431.00 rows=1 width=32) " Output: country, ref_year, count(), count(DISTINCT ref_uid)" " Group Key: reference_info.country, reference_info.ref_year" -> Sort (cost=0.00..431.00 rows=1 width=24) " Output: country, ref_uid, ref_year" " Sort Key: reference_info.country, reference_info.ref_year" -> Redistribute Motion 36:36 (slice2; segments: 36) (cost=0.00..431.00 rows=1 width=24) " Output: country, ref_uid, ref_year" " Hash Key: country, ref_year" -> Seq Scan on reference_info (cost=0.00..431.00 rows=1 width=24) " Output: country, ref_uid, ref_year" Optimizer: GPORCA

explain(verbose,analyze) output: QUERY PLAN Gather Motion 36:1 (slice1; segments: 36) (cost=0.00..431.00 rows=1 width=32) (actual time=72085.217..115653.745 rows=2480 loops=1) " Output: country, ref_year, (count()), (count(DISTINCT ref_uid))" -> GroupAggregate (cost=0.00..431.00 rows=1 width=32) (actual time=69972.797..88910.012 rows=92 loops=1) " Output: country, ref_year, count(), count(DISTINCT ref_uid)" " Group Key: reference_info.country, reference_info.ref_year" -> Sort (cost=0.00..431.00 rows=1 width=24) (actual time=69832.795..71977.819 rows=8223755 loops=1) " Output: country, ref_uid, ref_year" " Sort Key: reference_info.country, reference_info.ref_year" Sort Method: quicksort Memory: 122481kB Max Memory: 62527kB Avg Memory: 61240kB (2 segments) Sort Method: external merge Disk: 3588512kB Max Memory: 377088kB Avg Memory: 105544kB (34 segments) work_mem: 2236160kB Segments: 36 Max: 63854kB (segment 1) Workfile: (34 spilling) -> Redistribute Motion 36:36 (slice2; segments: 36) (cost=0.00..431.00 rows=1 width=24) (actual time=3.000..66710.760 rows=8223755 loops=1) " Output: country, ref_uid, ref_year" " Hash Key: country, ref_year" -> Seq Scan on reference_info (cost=0.00..431.00 rows=1 width=24) (actual time=5.000..2059.006 rows=2290316 loops=1) " Output: country, ref_uid, ref_year" Planning Time: 13.835 ms (slice0) Executor memory: 117K bytes. " (slice1) Executor memory: 106412K bytes avg x 36x(0) workers, 141357K bytes max (seg23). Work_mem: 63854K bytes max." " (slice2) Executor memory: 530K bytes avg x 36x(0) workers, 530K bytes max (seg0)." Memory used: 128000kB Optimizer: GPORCA Execution Time: 115751.813 ms

Both of them is worked

Thanks!

@my-ship-it

avamingli avatar Jul 25 '25 02:07 avamingli

@avamingli, is it a ORCA issue? Could we have please help investigate, thanks!

my-ship-it avatar Jul 25 '25 03:07 my-ship-it

@cleverxiao001 Could you search any error messages in logs for seg7?

my-ship-it avatar Jul 25 '25 03:07 my-ship-it

error log in seg7: 2025-07-25 00:53:59.841934 EDT,"gpadmin","postgres",p268864,th1103079616,"168.160.18.164","59040",2025-07-25 00:53:59 EDT,0,con3,,seg7,,,,sx1,"LOG","00000","statement: select gid from pg_prepared_xacts where prepared < now() - interval'120 seconds'",,,,,,,0,,"postgres.c",1713, 2025-07-25 00:54:11.632398 EDT,,,p268891,th1103079616,"168.160.18.164","23646",2025-07-25 00:54:11 EDT,0,,,seg7,,,,,"LOG","00000","query finish request to process 268743",,,,,,,0,,"postmaster.c",2932, 2025-07-25 00:54:12.491000 EDT,,,p268743,th123456,,,2025-07-25 00:52:58 EDT,0,con3202,cmd2,seg7,slice1,,,,"PANIC","XX000","Unexpected internal error: Segment process received signal SIGSEGV",,,,,,,0,,,,"1 0x7fc543ffa92c libpostgres.so gp_backtrace + 0x3c 2 0x7fc543ffabd3 libpostgres.so StandardHandlerForSigillSigsegvSigbus_OnMainThread + 0x153 3 0x7fc54303ebf0 libc.so.6 + 0x4303ebf0 4 0x7fc543ffe090 libpostgres.so pg_detoast_datum_packed + 0x0 5 0x7fc543fa7636 libpostgres.so + 0x43fa7636 6 0x7fc54405aa69 libpostgres.so + 0x4405aa69 7 0x7fc54405ae8b libpostgres.so + 0x4405ae8b 8 0x7fc54405f86e libpostgres.so tuplesort_performsort + 0x5e 9 0x7fc543c4e9c1 libpostgres.so + 0x43c4e9c1 10 0x7fc543c50e66 libpostgres.so + 0x43c50e66 11 0x7fc543c3e8cd libpostgres.so + 0x43c3e8cd 12 0x7fc543c8e3c2 libpostgres.so + 0x43c8e3c2 13 0x7fc543c3eb0d libpostgres.so + 0x43c3eb0d 14 0x7fc543c343da libpostgres.so + 0x43c343da 15 0x7fc543c351a1 libpostgres.so standard_ExecutorRun + 0x811 16 0x7fc543c353a5 libpostgres.so ExecutorRun + 0xc5 17 0x7fc543e908df libpostgres.so + 0x43e908df 18 0x7fc543e926af libpostgres.so PortalRun + 0x1af 19 0x7fc543e8c427 libpostgres.so + 0x43e8c427 20 0x7fc543e8fdc4 libpostgres.so PostgresMain + 0x20e4 21 0x7fc543dd5b62 libpostgres.so + 0x43dd5b62 22 0x7fc543dd6c41 libpostgres.so PostmasterMain + 0xe61 23 0x4027eb postgres main (main.c:200) 24 0x7fc5430295d0 libc.so.6 + 0x430295d0 25 0x7fc543029680 libc.so.6 __libc_start_main + 0x80 26 0x402985 postgres _start + 0x25 " 2025-07-25 00:54:13.510607 EDT,,,p4005057,th1103079616,,,,0,,,seg7,,,,,"LOG","00000","server process (PID 268743) was terminated by signal 11: Segmentation fault","Failed process was running: select country, ref_year, count() as ref_frequency, count(distinct ref_uid) as ref_quantity from reference_info group by country, ref_year",,,,,,0,,"postmaster.c",4280, 2025-07-25 00:54:13.510647 EDT,,,p4005057,th1103079616,,,,0,,,seg7,,,,,"LOG","00000","terminating any other active server processes",,,,,,,0,,"postmaster.c",3995, 2025-07-25 00:54:13.510746 EDT,,,p268442,th1103079616,,,,0,,,seg7,,,,,"LOG","00000","ic-proxy: received signal 3",,,,,,,0,,"ic_proxy_main.c",484, 2025-07-25 00:54:13.510756 EDT,,,p268442,th1103079616,,,,0,,,seg7,,,,,"LOG","00000","ic-proxy: server closing",,,,,,,0,,"ic_proxy_main.c",585, 2025-07-25 00:54:13.510779 EDT,,,p268442,th1103079616,,,,0,,,seg7,,,,,"LOG","00000","ic-proxy: server closed with code 1",,,,,,,0,,"ic_proxy_main.c",599, 2025-07-25 00:54:13.513116 EDT,,,p4005057,th1103079616,,,,0,,,seg7,,,,,"LOG","00000","background worker ""ic proxy process"" (PID 268442) exited with exit code 1",,,,,,,0,,"postmaster.c",4259, 2025-07-25 00:54:13.513301 EDT,,,p4005057,th1103079616,,,,0,,,seg7,,,,,"LOG","00000","all server processes terminated; reinitializing",,,,,,,0,,"postmaster.c",4571, 2025-07-25 00:54:13.589507 EDT,,,p268902,th1103079616,,,,0,,,seg7,,,,,"LOG","00000","database system was interrupted; last known up at 2025-07-25 00:51:50 EDT",,,,,,,0,,"xlog.c",6822, 2025-07-25 00:54:13.589534 EDT,,,p268902,th1103079616,,,,0,,,seg7,,,,,"LOG","00000","Synchronization of the wal directory starts.",,,,,,,0,,"fd.c",3452, 2025-07-25 00:54:13.589586 EDT,,,p268902,th1103079616,,,,0,,,seg7,,,,,"LOG","00000","synchronization of the wal directory finishes.",,,,,,,0,,"fd.c",3454, 2025-07-25 00:54:13.589883 EDT,,,p268902,th1103079616,,,,0,,,seg7,,,,,"LOG","00000","database system was not properly shut down; automatic recovery in progress",,,,,,,0,,"xlog.c",7391, 2025-07-25 00:54:13.592357 EDT,,,p268902,th1103079616,,,,0,,,seg7,,,,,"LOG","00000","redo starts at 5/3B21C628",,,,,,,0,,"xlog.c",7671, 2025-07-25 00:54:13.592373 EDT,,,p268902,th1103079616,,,,0,,,seg7,,,,,"LOG","00000","invalid record length at 5/3B21C660: wanted 24, got 0",,,,,,,0,,"xlog.c",4488, 2025-07-25 00:54:13.592379 EDT,,,p268902,th1103079616,,,,0,,,seg7,,,,,"LOG","00000","redo done at 5/3B21C628 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s",,,,,,,0,,"xlog.c",7959, 2025-07-25 00:54:13.595957 EDT,,,p268902,th1103079616,,,,0,,,seg7,,,,,"LOG","00000","end of transaction log location is 5/3B21C660",,,,,,,0,,"xlog.c",8048, 2025-07-25 00:54:13.596305 EDT,,,p268902,th1103079616,,,,0,,,seg7,,,,,"LOG","00000","checkpoint starting: end-of-recovery immediate",,,,,,,0,,"xlog.c",9239, 2025-07-25 00:54:13.597124 EDT,,,p268902,th1103079616,,,,0,,,seg7,,,,,"LOG","00000","checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.002 s; sync files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0 kB",,,,,,,0,,"xlog.c",9320, 2025-07-25 00:54:13.597140 EDT,,,p268902,th1103079616,,,,0,,,seg7,,,,,"LOG","00000","latest completed transaction id is 492382 and next transaction id is 492383",,,,,,,0,,"xlog.c",8446, 2025-07-25 00:54:13.597377 EDT,,,p268902,th1103079616,,,,0,,,seg7,,,,,"LOG","00000","database system is ready",,,,,,,0,,"xlog.c",8473, 2025-07-25 00:54:13.602248 EDT,,,p4005057,th1103079616,,,,0,,,seg7,,,,,"LOG","00000","PostgreSQL 14.4 (Apache Cloudberry 2.0.0-incubating-rc2 build dev) on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), 64-bit compiled on Jul 8 2025 03:25:08 (with assert checking)",,,,,,,0,,"postmaster.c",3564, 2025-07-25 00:54:13.602271 EDT,,,p4005057,th1103079616,,,,0,,,seg7,,,,,"LOG","00000","database system is ready to accept connections","PostgreSQL 14.4 (Apache Cloudberry 2.0.0-incubating-rc2 build dev) on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), 64-bit compiled on Jul 8 2025 03:25:08 (with assert checking)",,,,,,0,,"postmaster.c",3566, 2025-07-25 00:54:13.602863 EDT,,,p268910,th1103079616,,,,0,,,seg7,,,,,"LOG","00000","ic-proxy: server setting up",,,,,,,0,,"ic_proxy_main.c",531, 2025-07-25 00:54:13.603244 EDT,,,p268910,th1103079616,,,,0,,,seg7,,,,,"LOG","00000","ic-proxy: server running",,,,,,,0,,"ic_proxy_main.c",572, 2025-07-25 00:54:59.968285 EDT,"gpadmin","postgres",p268993,th1103079616,"168.160.18.164","50728",2025-07-25 00:54:59 EDT,0,con3,,seg7,,,,sx1,"LOG","00000","statement: select gid from pg_prepared_xacts where prepared < now() - interval'120 seconds'",,,,,,,0,,"postgres.c",1713,

cleverxiao001 avatar Jul 25 '25 04:07 cleverxiao001

It seems that the error is related to your data. It's abnormal that the detoast occurred in the sort slice.

my-ship-it avatar Jul 25 '25 09:07 my-ship-it

All data is imported by using Spark to call the method: copy tableName from stdin with (format csv, NULL '\N'). I wonder if it has anything to do with the format of the CSV.

cleverxiao001 avatar Jul 28 '25 06:07 cleverxiao001

Maybe this is a layout issue. Are all segments in a machine ? Could you provide all segment system info, moreover system info for compile postgres bin. @cleverxiao001

hw118118 avatar Oct 14 '25 08:10 hw118118

The cluster consists of two servers in total. The segments are evenly distributed across the two servers. All hard drives are installed on one of the servers, managed via LVM and divided into multiple LV. One of these LV is mounted on the other server through NFS

cleverxiao001 avatar Dec 17 '25 09:12 cleverxiao001