Improvement: reuse the tcp connection when plan files
Considering I am trying to read an iceberg table from S3. Currently, plan_files() seems unable to reuse the TCP connection for HTTP requests. It will lead to a relatively high latency. I am not sure whether it is a good practice to enable Connection: keep-alive by default, what do you think?
Considering I am trying to read an iceberg table from S3. Currently,
plan_files()seems unable to reuse the TCP connection for HTTP requests. It will lead to a relatively high latency. I am not sure whether it is a good practice to enableConnection: keep-aliveby default, what do you think?
I think the buffering is handles by opendal? cc @Xuanwo
Hi, OpenDAL handles those connection-related tasks (by reqwest).
Currently, FileIO builds new operators every time:
https://github.com/apache/iceberg-rust/blob/4083f8109b1059c6e3cb7df7ae50a2c8cdc8a613/crates/iceberg/src/io/storage.rs#L104-L119
I think we can improve this by using the same HTTP client instead.
Great, Looking forward to it!
I have tested duckdb and clickhouse to scan an iceberg table in MinIO as well. They can reuse a TCP connection to send HTTP requests. BTW, clickhouse can reuse connections aggressively.
DuckDB:
Clickhouse:
However, risingwave the database use iceberg-rust seems can't reuse any TCP connection. You can see that every HTTP request is in its own TCP connection.
Thanks for the sharing. I believe this can be addressed by reusing the same http client.
We will need https://github.com/apache/opendal/issues/4967 for this. I'm working on it now.
We will need apache/opendal#4967 for this. I'm working on it now.
Which opendal's version do we need to bump into?
Which opendal's version do we need to bump into?
I'm guessing it will be included in our next release 0.49.
I have already verified that using a global HTTP client works in this issue and the performance improvement is impressive (about 10 times faster: from 1500+ms to 150+ms). https://github.com/risingwavelabs/iceberg-rust/blob/24bd5869f2779a8b9786b5a6e1f9723844f5a82c/crates/iceberg/src/io.rs#L438-L440
Also, cc @sdd, who is focusing on the iceberg benchmark now.
I also have some local code that reuses the same OpenDAL operator rather than creating a new one each time. I'd not submitted it yet as I wasn't sure of the validity of doing that in every possible scenario but it has been working well for me
I also have some local code that reuses the same OpenDAL operator rather than creating a new one each time. I'd not submitted it yet as I wasn't sure of the validity of doing that in every possible scenario but it has been working well for me
Great! I will address https://github.com/apache/opendal/issues/4967 first, then consider the best approach for iceberg-rust.