cloudberry [Bug] long time sql error occurred 'interconnect may encountered a network error, please check your network' when all segments are working normal

Cloudberry Database version

1.6.0

What happened

Why the same server and database configuration, the same table, the same sql, with all segments are working normal,gpdb1.6 execution for a period of time error occurred, greenplum7 will always execute the result with no error. This sql is the result((2301rows)) of union all three tables ,then left join another table (22917310rows 12G).

What you think should happen instead

No response

How to reproduce

May some timeout parameter? but i'm not find .

Operating System

Red Hat Enterprise Linux release 8.9 (Ootpa)

Anything else

No response

Are you willing to submit PR?

[X] Yes, I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct.

Sep 20 '24 14:09 fsmiledream

After set enable_parallel= off,this error not occurred; May the parameter enable_parallel = on have some problem?

Sep 23 '24 06:09 fsmiledream

If enable_parallel, it means more processes will be created, which would cause more connections and consume more resources.

Sep 23 '24 07:09 my-ship-it

But it's a test environment ,only 1 test connect . connections and resqueue is enough ,see pictures below . cbdb-queque cbdb-connect

you can provide info by command: ulimit -a

Sep 23 '24 09:09 fsmiledream

@avamingli Could help verify the issue. IIUC, it's not libpq connection but network connections of OS, more processes could consume more interconnect connections. You could set lower parallel degree by max_parallel_workers_per_gather.

Sep 25 '24 01:09 my-ship-it

@avamingli Could help verify the issue. IIUC, it's not libpq connection but network connections of OS, more processes could consume more interconnect connections. You could set lower parallel degree by max_parallel_workers_per_gather.

Yes, The connection num is decided by slice and gang and parallel workers.

@fsmiledream Could you provide the reproduce SQL and plan?

Sep 25 '24 02:09 avamingli

max_parallel_workers_per_gather=64 in my environment. The sql : SELECT A.serial_no, a.visit_start_time, A.visit_end_time, bpi.* FROM (SELECT ov.org_code, ov.person_info_id, ov.serial_no, '1' AS visit_type_code, ov.visit_start_time, ov.visit_end_time
FROM visit ov
WHERE ov.update_datetime BETWEEN '2024-09-13 13:43:33.180' AND '2024-09-14 13:43:33.180'
UNION ALL SELECT ir.org_code, ir.person_info_id, ir.serial_no, '2' AS visit_type_code, adm_time as visit_start_time, discharge_time as visit_end_time
FROM record ir
WHERE ir.update_datetime BETWEEN '2024-09-13 13:43:33.180' AND '2024-09-14 13:43:33.180'
UNION ALL
SELECT per.org_code, per.person_info_id, per.serial_no, '3' AS visit_type_code, start_time as visit_start_time, end_time as visit_end_time FROM phy_exam_register per WHERE per.update_datetime BETWEEN '2024-09-13 13:43:33.180' AND '2024-09-14 13:43:33.180' ) A
left JOIN basic_info bpi ON A.org_code = bpi.org_code AND A.person_info_id = bpi.person_info_id AND A.visit_type_code = bpi.visit_type_code;

Sep 27 '24 07:09 fsmiledream

max_parallel_workers_per_gather=64 too large , 1~8 is recommended.

Sep 27 '24 07:09 yjhjstz

max_parallel_workers_per_gather=64 is also in greenplum .Because the server configuration is enough. There are four physical machines, each with 256GB of memory and 64-core cpu,and ssd disks.

Sep 27 '24 08:09 fsmiledream

max_parallel_workers_per_gather=64 is also in greenplum .

Greenplum doesn't have parallel plan feature, it's a dead GUC for GPDB. It doesn't matter what you set it in GPDB.

CBDB parallel plan provide better performance for query on large tables/joins, and of course, it consumes more resources.

Running non-parallel plan in GPDB, it doesn't report warnings, if you run non-parallel CBDB plan, it doesn't neither.

As said: max_parallel_workers_per_gather=64 is too large, in production, we don't recommend it bigger than 8. We follow Postgres, the default value is 2, considering slices and motions across slices, there could be a large number of connections for your query. Also, I see your config max_connections = 800, it surely bad thing will happen if there are really so many connections of long query in MPP.

Sep 27 '24 09:09 avamingli

No updates coming after suggestion given. Thanks all for helping on it.

Oct 09 '24 11:10 my-ship-it