Investigate OpenDAL features
What is the problem the feature request solves?
We would like to learn more about OpenDAL for supporting HDFS, S3, and other file stores.
Some questions to answer:
- Do we still need to pursue the fs-hdfs approach or can we replace it with OpenDAL
- How does OpenDAL support s3? Does it use a Rust-native solution using the AWS Rust SDK, or does it use the same approach as HDFS support where it uses hadoop-aws in JVM?
- Can OpenDAL support all of our requirements around custom authentication?
- Does OpenDAL perform well?
Describe the potential solution
No response
Additional context
No response
I started learning about OpenDAL. Some initial notes:
- There is an object_store_opendal crate that wraps OpenDAL in the object-store-rs API, making it easy to integrate into Comet/DataFusion (we already support OpenDAL HDFS this way)
- OpenDAL's s3 support is native Rust code and does not use the AWS Rust SDK
Thank you for starting this!
Do we still need to pursue the fs-hdfs approach or can we replace it with OpenDAL
I 100% support replacing with OpenDAL
How does OpenDAL support s3?
Yes
Does it use a Rust-native solution using the AWS Rust SDK, or does it use the same approach as HDFS support where it uses hadoop-aws in JVM?
Build it by talking directly to the HTTP API from scratch. To ensure its correctness, we test against different kinds of S3 services.
Can OpenDAL support all of our requirements around custom authentication?
Yes. OpenDAL is powered by reqsign and allow users to implement their own authentication
Does OpenDAL perform well?
Yes. Already adopted by many databases.
It's built on zero-cost principles so users don't have to pay extra for things they don't need. And opendal also provides the best native concurrent read/write API like:
// download file with 16 concurrent requests with 4MiB as chunk.
let data = op.read_with(path).chunk(4 * 1024 * 1024).concurrent(16).await?;
// upload file with 4 concurrent requests with 16MiB as chunk.
let _ = op.write_with(path, data).chunk(16 * 1024 * 1024).concurrent(4).await?;
Does it use a Rust-native solution using the AWS Rust SDK, or does it use the same approach as HDFS support where it uses hadoop-aws in JVM?
Build it by talking directly to the HTTP API from scratch. To ensure its correctness, we test against different kinds of S3 services.
We'll probably need a JNI layer to be able to support custom authentication providers written in Java.
Thanks @Xuanwo I have tested openDAL with local 3 node distributed HDFS cluster and it worked even when rerouting datanodes addresses.
We also planning to make series of tests on remote cluster. Do you by any chance have openDAL performance metrics documented?
Checked remote HDFS cluster with KRB protection, worked fine
Hi ! Also supportive of OpenDAL which we use a lot to access Hugging Face Datasets, it would be exciting to have Comet able to read/write to HF Datasets.