cloudberry icon indicating copy to clipboard operation
cloudberry copied to clipboard

[Bug] Interconnect High Retry Counts During TPC-DS 100GB Execution with 4 Concurrent Users

Open congxuebin opened this issue 8 months ago • 2 comments

Apache Cloudberry version

main branch, commit:1cbab9b3

What happened

psql:q67.sql:46: WARNING: interconnect may encountered a network error, please check your network (seg72 slice6 10.13.8.103:9000 pid=2229356) DETAIL: Failing to send packet (seq 1) to 10.13.8.102:55905 (pid 1668328 cid 70) after 100 retries. psql:q67.sql:46: WARNING: interconnect may encountered a network error, please check your network (seg74 slice6 10.13.8.103:9002 pid=2229359) DETAIL: Failing to send packet (seq 1) to 10.13.8.100:24440 (pid 3218409 cid 17) after 100 retries. psql:q67.sql:46: WARNING: interconnect may encountered a network error, please check your network (seg73 slice7 10.13.8.103:9001 pid=2229548) DETAIL: Failing to send packet (seq 1) to 10.13.8.102:24287 (pid 1668320 cid 62) after 100 retries. psql:q67.sql:46: WARNING: interconnect may encountered a network error, please check your network (seg64 slice5 10.13.8.102:9016 pid=1668898) DETAIL: Failing to send packet (seq 1) to 10.13.8.101:17582 (pid 1698993 cid 28) after 100 retries. psql:q67.sql:46: WARNING: interconnect may encountered a network error, please check your network (seg31 slice4 10.13.8.101:9007 pid=1699404) DETAIL: Failing to send packet (seq 1) to 10.13.8.102:62775 (pid 1668295 cid 50) after 100 retries.

What you think should happen instead

No response

How to reproduce

Generate 100GB of data by setting GEN_DATA_SCALE="100" in the TPC-DS variables.

Download the scripts provided below. Ensure that the role and schema in the search_path in q67.sql match the settings used when generating the TPC-DS data. test.zip

Run sh test.sh to initiate the test.

Operating System

Oracle Linux 9.5

Anything else

No response

Are you willing to submit PR?

  • [ ] Yes, I am willing to submit a PR!

Code of Conduct

congxuebin avatar Apr 25 '25 07:04 congxuebin

please collect info when got error.

collect.txt

rename it to collect.sh and verify PRIMARY_INTERFACE var.

yjhjstz avatar Apr 28 '25 15:04 yjhjstz

@jiaqizho can you take a look Q67 orca plan, up to 15 slices.

yjhjstz avatar Apr 28 '25 15:04 yjhjstz

@oracleloyall Hi Xi, this is the issue item about Interconnect UDP flow control not working in certain case.

congxuebin avatar May 23 '25 06:05 congxuebin

@jiaqizho can you take a look Q67 orca plan, up to 15 slices.

any context about this topic ?

jiaqizho avatar Jun 19 '25 02:06 jiaqizho

@jiaqizho This issue relates to Interconnect UDP flow control which is not functioning properly in certain case. Xi has been working on resolving this problem.

congxuebin avatar Jun 20 '25 12:06 congxuebin

@jiaqizho This issue relates to Interconnect UDP flow control which is not functioning properly in certain case. Xi has been working on resolving this problem.

Hi @congxuebin , is there a related issue/PR? I might be interested in this as well, cuz we also had a bunch of problems with IC retries in GP6.

Smyatkin-Maxim avatar Jun 23 '25 08:06 Smyatkin-Maxim

@Smyatkin-Maxim I believe currently no. @oracleloyall is the developer who is working on resolving this issue.

congxuebin avatar Jun 24 '25 03:06 congxuebin