cloudberry icon indicating copy to clipboard operation
cloudberry copied to clipboard

[Bug] long time sql error occurred 'interconnect may encountered a network error, please check your network' when all segments are working normal

Open fsmiledream opened this issue 1 year ago • 9 comments

Cloudberry Database version

1.6.0

What happened

Why the same server and database configuration, the same table, the same sql, with all segments are working normal,gpdb1.6 execution for a period of time error occurred, greenplum7 will always execute the result with no error. This sql is the result((2301rows)) of union all three tables ,then left join another table (22917310rows 12G).

cbdb-sqlresult gp7-sqlresult cbdb-segment-allwork

What you think should happen instead

No response

How to reproduce

May some timeout parameter? but i'm not find .

Operating System

Red Hat Enterprise Linux release 8.9 (Ootpa)

Anything else

No response

Are you willing to submit PR?

  • [X] Yes, I am willing to submit a PR!

Code of Conduct

fsmiledream avatar Sep 20 '24 14:09 fsmiledream

After set enable_parallel= off,this error not occurred; May the parameter enable_parallel = on have some problem?

fsmiledream avatar Sep 23 '24 06:09 fsmiledream

If enable_parallel, it means more processes will be created, which would cause more connections and consume more resources.

my-ship-it avatar Sep 23 '24 07:09 my-ship-it

But it's a test environment ,only 1 test connect . connections and resqueue is enough ,see pictures below . cbdb-queque cbdb-connect

you can provide info by command: ulimit -a

fsmiledream avatar Sep 23 '24 09:09 fsmiledream

@avamingli Could help verify the issue. IIUC, it's not libpq connection but network connections of OS, more processes could consume more interconnect connections. You could set lower parallel degree by max_parallel_workers_per_gather.

my-ship-it avatar Sep 25 '24 01:09 my-ship-it

@avamingli Could help verify the issue. IIUC, it's not libpq connection but network connections of OS, more processes could consume more interconnect connections. You could set lower parallel degree by max_parallel_workers_per_gather.

Yes, The connection num is decided by slice and gang and parallel workers.

@fsmiledream Could you provide the reproduce SQL and plan?

avamingli avatar Sep 25 '24 02:09 avamingli

max_parallel_workers_per_gather=64 in my environment. The sql : SELECT A.serial_no, a.visit_start_time, A.visit_end_time, bpi.* FROM (SELECT ov.org_code, ov.person_info_id, ov.serial_no, '1' AS visit_type_code, ov.visit_start_time, ov.visit_end_time
FROM visit ov
WHERE ov.update_datetime BETWEEN '2024-09-13 13:43:33.180' AND '2024-09-14 13:43:33.180'
UNION ALL SELECT ir.org_code, ir.person_info_id, ir.serial_no, '2' AS visit_type_code, adm_time as visit_start_time, discharge_time as visit_end_time
FROM record ir
WHERE ir.update_datetime BETWEEN '2024-09-13 13:43:33.180' AND '2024-09-14 13:43:33.180'
UNION ALL
SELECT per.org_code, per.person_info_id, per.serial_no, '3' AS visit_type_code, start_time as visit_start_time, end_time as visit_end_time FROM phy_exam_register per WHERE per.update_datetime BETWEEN '2024-09-13 13:43:33.180' AND '2024-09-14 13:43:33.180' ) A
left JOIN basic_info bpi ON A.org_code = bpi.org_code AND A.person_info_id = bpi.person_info_id AND A.visit_type_code = bpi.visit_type_code;

fsmiledream avatar Sep 27 '24 07:09 fsmiledream

max_parallel_workers_per_gather=64 too large , 1~8 is recommended.

yjhjstz avatar Sep 27 '24 07:09 yjhjstz

max_parallel_workers_per_gather=64 is also in greenplum .Because the server configuration is enough. There are four physical machines, each with 256GB of memory and 64-core cpu,and ssd disks.

fsmiledream avatar Sep 27 '24 08:09 fsmiledream

max_parallel_workers_per_gather=64 is also in greenplum .

Greenplum doesn't have parallel plan feature, it's a dead GUC for GPDB. It doesn't matter what you set it in GPDB.

CBDB parallel plan provide better performance for query on large tables/joins, and of course, it consumes more resources.

Running non-parallel plan in GPDB, it doesn't report warnings, if you run non-parallel CBDB plan, it doesn't neither.

As said: max_parallel_workers_per_gather=64 is too large, in production, we don't recommend it bigger than 8. We follow Postgres, the default value is 2, considering slices and motions across slices, there could be a large number of connections for your query. Also, I see your config max_connections = 800, it surely bad thing will happen if there are really so many connections of long query in MPP.

avamingli avatar Sep 27 '24 09:09 avamingli

No updates coming after suggestion given. Thanks all for helping on it.

my-ship-it avatar Oct 09 '24 11:10 my-ship-it