[Bug] Interconnect High Retry Counts During TPC-DS 100GB Execution with 4 Concurrent Users
Apache Cloudberry version
main branch, commit:1cbab9b3
What happened
psql:q67.sql:46: WARNING: interconnect may encountered a network error, please check your network (seg72 slice6 10.13.8.103:9000 pid=2229356) DETAIL: Failing to send packet (seq 1) to 10.13.8.102:55905 (pid 1668328 cid 70) after 100 retries. psql:q67.sql:46: WARNING: interconnect may encountered a network error, please check your network (seg74 slice6 10.13.8.103:9002 pid=2229359) DETAIL: Failing to send packet (seq 1) to 10.13.8.100:24440 (pid 3218409 cid 17) after 100 retries. psql:q67.sql:46: WARNING: interconnect may encountered a network error, please check your network (seg73 slice7 10.13.8.103:9001 pid=2229548) DETAIL: Failing to send packet (seq 1) to 10.13.8.102:24287 (pid 1668320 cid 62) after 100 retries. psql:q67.sql:46: WARNING: interconnect may encountered a network error, please check your network (seg64 slice5 10.13.8.102:9016 pid=1668898) DETAIL: Failing to send packet (seq 1) to 10.13.8.101:17582 (pid 1698993 cid 28) after 100 retries. psql:q67.sql:46: WARNING: interconnect may encountered a network error, please check your network (seg31 slice4 10.13.8.101:9007 pid=1699404) DETAIL: Failing to send packet (seq 1) to 10.13.8.102:62775 (pid 1668295 cid 50) after 100 retries.
What you think should happen instead
No response
How to reproduce
Generate 100GB of data by setting GEN_DATA_SCALE="100" in the TPC-DS variables.
Download the scripts provided below. Ensure that the role and schema in the search_path in q67.sql match the settings used when generating the TPC-DS data. test.zip
Run sh test.sh to initiate the test.
Operating System
Oracle Linux 9.5
Anything else
No response
Are you willing to submit PR?
- [ ] Yes, I am willing to submit a PR!
Code of Conduct
- [x] I agree to follow this project's Code of Conduct.
please collect info when got error.
rename it to collect.sh and verify PRIMARY_INTERFACE var.
@jiaqizho can you take a look Q67 orca plan, up to 15 slices.
@oracleloyall Hi Xi, this is the issue item about Interconnect UDP flow control not working in certain case.
@jiaqizho This issue relates to Interconnect UDP flow control which is not functioning properly in certain case. Xi has been working on resolving this problem.
@jiaqizho This issue relates to Interconnect UDP flow control which is not functioning properly in certain case. Xi has been working on resolving this problem.
Hi @congxuebin , is there a related issue/PR? I might be interested in this as well, cuz we also had a bunch of problems with IC retries in GP6.
@Smyatkin-Maxim I believe currently no. @oracleloyall is the developer who is working on resolving this issue.