datafusion-comet icon indicating copy to clipboard operation
datafusion-comet copied to clipboard

Investigate OpenDAL features

Open andygrove opened this issue 3 months ago • 1 comments

What is the problem the feature request solves?

We would like to learn more about OpenDAL for supporting HDFS, S3, and other file stores.

Some questions to answer:

  • Do we still need to pursue the fs-hdfs approach or can we replace it with OpenDAL
  • How does OpenDAL support s3? Does it use a Rust-native solution using the AWS Rust SDK, or does it use the same approach as HDFS support where it uses hadoop-aws in JVM?
  • Can OpenDAL support all of our requirements around custom authentication?
  • Does OpenDAL perform well?

Describe the potential solution

No response

Additional context

No response

andygrove avatar Sep 10 '25 14:09 andygrove

I started learning about OpenDAL. Some initial notes:

  • There is an object_store_opendal crate that wraps OpenDAL in the object-store-rs API, making it easy to integrate into Comet/DataFusion (we already support OpenDAL HDFS this way)
  • OpenDAL's s3 support is native Rust code and does not use the AWS Rust SDK

andygrove avatar Oct 06 '25 22:10 andygrove

Thank you for starting this!

Do we still need to pursue the fs-hdfs approach or can we replace it with OpenDAL

I 100% support replacing with OpenDAL

How does OpenDAL support s3?

Yes

Does it use a Rust-native solution using the AWS Rust SDK, or does it use the same approach as HDFS support where it uses hadoop-aws in JVM?

Build it by talking directly to the HTTP API from scratch. To ensure its correctness, we test against different kinds of S3 services.

Image

Can OpenDAL support all of our requirements around custom authentication?

Yes. OpenDAL is powered by reqsign and allow users to implement their own authentication

Image

Does OpenDAL perform well?

Yes. Already adopted by many databases.

It's built on zero-cost principles so users don't have to pay extra for things they don't need. And opendal also provides the best native concurrent read/write API like:

// download file with 16 concurrent requests with 4MiB as chunk.
let data = op.read_with(path).chunk(4 * 1024 * 1024).concurrent(16).await?;

// upload file with 4 concurrent requests with 16MiB as chunk.
let _ = op.write_with(path, data).chunk(16 * 1024 * 1024).concurrent(4).await?;

Xuanwo avatar Dec 08 '25 08:12 Xuanwo

Does it use a Rust-native solution using the AWS Rust SDK, or does it use the same approach as HDFS support where it uses hadoop-aws in JVM?

Build it by talking directly to the HTTP API from scratch. To ensure its correctness, we test against different kinds of S3 services.

We'll probably need a JNI layer to be able to support custom authentication providers written in Java.

parthchandra avatar Dec 08 '25 17:12 parthchandra

Thanks @Xuanwo I have tested openDAL with local 3 node distributed HDFS cluster and it worked even when rerouting datanodes addresses.

We also planning to make series of tests on remote cluster. Do you by any chance have openDAL performance metrics documented?

comphead avatar Dec 09 '25 21:12 comphead

Checked remote HDFS cluster with KRB protection, worked fine

comphead avatar Dec 11 '25 23:12 comphead

Hi ! Also supportive of OpenDAL which we use a lot to access Hugging Face Datasets, it would be exciting to have Comet able to read/write to HF Datasets.

lhoestq avatar Dec 16 '25 18:12 lhoestq