delta-rs
delta-rs copied to clipboard
hdfs support
Description
HDFS storage support.
Use Case A significant portion of companies dealing with big data uses HDFS as the backend storage solution of choice for long-term persistent data storage and processing. Having this would be very beneficial for the place I currently work at.
@houqp I did some search on HDFS libraries for Rust and found this crate: https://crates.io/crates/fs-hdfs . But it seems to have a lot of dependencies to run. Do you have any suggestions on this?
@yjshen wrote a wrapper for libhdfs3: https://github.com/datafusion-contrib/datafusion-hdfs-native. This one is a lot leaner since it only has a c++ dependency. Perhaps you can work with him to convert that binding into its own crate? Right now it's coupled with the datafusion hdfs object store implementation.
Good to know that. @yjshen do you have time to convert your work into a new crate? It would be very helpful for other Rust projects too.
@houqp Do you think datafusion-contrib is the right place to hold this hdfs rust repo? or should I make it under my account?
@yjshen up to you, since you are the author :)
Hey @yjshen , any update on this? Currently we are using MinIO on HDFS as a workaround. But it seems to be not a sustainable way: https://github.com/minio/minio/issues/13927 . We are all counting on you now :)
@zijie0 let's cooperate here : https://github.com/datafusion-contrib/hdfs-native
Cool! @yjshen
Sorry everyone I realized that this feature might not be trivial to support. HDFS and the apache stack can be very complicated to support. Optimally users connecting to HDFS should be using the correct client version intended by the cluster maintainers. This can often mean that the user should use the version of HDFS as provided in the system they are running it on. So to support this feature I think the options are as following
- Do dynamic linking to existing system libraries.
- Include hdfs dependencies but and multiple versions of delta-rs based on hdfs version.
- Add dependency to wrapper library like hdfs-native by @yjshen.
But this will open up a whole new bag of worms that I don't think is good for any project to experience. Not to mention both approaches will increase the installation complexity to end-users (that most users probably would not be too experienced with).
A workaround is to mount HDFS and access it like a regular filesystem and allow delta-rs to access hdfs this way though this is just a suggestion.
I find that the solution by @yjshen is a really sound one but installation will likely differ across systems. Eg. the company I work at has a custom hdfs version. So I'll have to make sure to build hdfs-native correctly (and not follow the instructions in the repo). The barrier of adoption might be high.
Hi @mingruimingrui , I've met with a similar problem, a customized HDFS version similar to yours. To make it worse, we even use HDFS with federation that isn't supported by native CPP implementations.
Since the motivation for me to implement hdfs-native
as well as datafusion-hdfs
is to call DataFusion through JNI in Spark executors to boost the performance. We currently adopt another approach: create an HDFS client in JVM and share it through JNI, as a workaround for the situation that our customized HDFS only maintains its Java Client.
Eg. the company I work at has a custom hdfs version. So I'll have to make sure to build hdfs-native correctly (and not follow the instructions in the repo). The barrier of adoption might be high.
Yeah, unfortunately for custom setup, a custom build will be needed for native applications. I am guessing clickhouse has the same problem as well.
Yes, that is true for ClickHouse. For now, our hosted ClickHouse cluster can only use one single HDFS NameNode. Lack the capability to use federated HDFS.
Since Datafusion has implemented https://github.com/datafusion-contrib/datafusion-objectstore-hdfs. Does it help delta to support HDFS?
Yes. It looks like they have complete read support. But write support isn't incomplete.
Someone could integrate that into this package.