[Bug] long time sql error occurred 'interconnect may encountered a network error, please check your network' when all segments are working normal
Cloudberry Database version
1.6.0
What happened
Why the same server and database configuration, the same table, the same sql, with all segments are working normal,gpdb1.6 execution for a period of time error occurred, greenplum7 will always execute the result with no error. This sql is the result((2301rows)) of union all three tables ,then left join another table (22917310rows 12G).
What you think should happen instead
No response
How to reproduce
May some timeout parameter? but i'm not find .
Operating System
Red Hat Enterprise Linux release 8.9 (Ootpa)
Anything else
No response
Are you willing to submit PR?
- [X] Yes, I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct.
After set enable_parallel= off,this error not occurred; May the parameter enable_parallel = on have some problem?
If enable_parallel, it means more processes will be created, which would cause more connections and consume more resources.
But it's a test environment ,only 1 test connect . connections and resqueue is enough ,see pictures below .
you can provide info by command: ulimit -a
@avamingli Could help verify the issue. IIUC, it's not libpq connection but network connections of OS, more processes could consume more interconnect connections. You could set lower parallel degree by max_parallel_workers_per_gather.
@avamingli Could help verify the issue. IIUC, it's not libpq connection but network connections of OS, more processes could consume more interconnect connections. You could set lower parallel degree by max_parallel_workers_per_gather.
Yes, The connection num is decided by slice and gang and parallel workers.
@fsmiledream Could you provide the reproduce SQL and plan?
max_parallel_workers_per_gather=64 in my environment.
The sql :
SELECT A.serial_no, a.visit_start_time, A.visit_end_time, bpi.* FROM (SELECT ov.org_code, ov.person_info_id, ov.serial_no, '1' AS visit_type_code, ov.visit_start_time, ov.visit_end_time
FROM visit ov
WHERE ov.update_datetime BETWEEN '2024-09-13 13:43:33.180' AND '2024-09-14 13:43:33.180'
UNION ALL SELECT ir.org_code, ir.person_info_id, ir.serial_no, '2' AS visit_type_code, adm_time as visit_start_time, discharge_time as visit_end_time
FROM record ir
WHERE ir.update_datetime BETWEEN '2024-09-13 13:43:33.180' AND '2024-09-14 13:43:33.180'
UNION ALL
SELECT per.org_code, per.person_info_id, per.serial_no, '3' AS visit_type_code, start_time as visit_start_time, end_time as visit_end_time FROM phy_exam_register per WHERE per.update_datetime BETWEEN '2024-09-13 13:43:33.180' AND '2024-09-14 13:43:33.180' ) A
left JOIN basic_info bpi ON A.org_code = bpi.org_code AND A.person_info_id = bpi.person_info_id AND A.visit_type_code = bpi.visit_type_code;
max_parallel_workers_per_gather=64 too large , 1~8 is recommended.
max_parallel_workers_per_gather=64 is also in greenplum .Because the server configuration is enough. There are four physical machines, each with 256GB of memory and 64-core cpu,and ssd disks.
max_parallel_workers_per_gather=64 is also in greenplum .
Greenplum doesn't have parallel plan feature, it's a dead GUC for GPDB. It doesn't matter what you set it in GPDB.
CBDB parallel plan provide better performance for query on large tables/joins, and of course, it consumes more resources.
Running non-parallel plan in GPDB, it doesn't report warnings, if you run non-parallel CBDB plan, it doesn't neither.
As said:
max_parallel_workers_per_gather=64 is too large, in production, we don't recommend it bigger than 8.
We follow Postgres, the default value is 2, considering slices and motions across slices, there could be a large number of connections for your query.
Also, I see your config max_connections = 800, it surely bad thing will happen if there are really so many connections of long query in MPP.
No updates coming after suggestion given. Thanks all for helping on it.