delta-rs icon indicating copy to clipboard operation
delta-rs copied to clipboard

Consider adding Polars as possible data processing framework.

Open ritchie46 opened this issue 3 years ago • 11 comments

I see you already support DataFusion and Rust DataFrame as well, which both use apache arrow. It would be nice if we could read/write from/to Polars as well.

If you'd be willing to, I could initialize a PR that sets that up.

ritchie46 avatar Jul 10 '21 18:07 ritchie46

Yes, please do, we will be more than happy to merge the PR. Make sure you update the readme to mention Polars as well :)

houqp avatar Jul 10 '21 18:07 houqp

Sorry for the delay, I will work on is. Is it ok to also add an implementation based on arrow2 (under a feature flag)?

ritchie46 avatar Sep 12 '21 18:09 ritchie46

@ritchie46 yes, we want to move to arrow2 in the long run as well :) but it would be good if we can manage it with a feature flag because it's going to take some amount of time before we can migrate kafka-delta-ingest to arrow2.

houqp avatar Sep 12 '21 18:09 houqp

If arrow2 feature flag is not possible, we can also maintain an official arrow2 branch

houqp avatar Sep 12 '21 18:09 houqp

@ritchie46 yes, we want to move to arrow2 in the long run as well :)

Great! Polars already runs great on arrow2, and this way, I hope to improve third party support as well. :)

If arrow2 feature flag is not possible, we can also maintain an official arrow2 branch

Yes, I don't know how deep embedded arrow is in delta-rs?

ritchie46 avatar Sep 12 '21 19:09 ritchie46

Yes, I don't know how deep embedded arrow is in delta-rs?

It's not deep, I think the migration should be pretty straight forward. The hard part is isolating it into a feature flag so we can support both arrow and arrow2 in the short term.

houqp avatar Sep 12 '21 19:09 houqp

Would love to see this (specifically an arrow2 feature flag), could consolidate a lot of my custom code around parquet partitioning and writing to object storage + gain the benefits of delta lake on top of that.

wseaton avatar Sep 13 '21 13:09 wseaton

I got bounced on the arrow2 migration. It was not so trivial to feature gate indeed. So that would probably need a separate branch?

ritchie46 avatar Oct 16 '21 07:10 ritchie46

@ritchie46 I actually got something mostly working 2 weeks ago, but got distracted with datafusion and roapi so wasn't able to complete the PoC. Let me try to get something pushed out this weekend so we can take a look at it together to see if that's the right path forward.

houqp avatar Oct 16 '21 07:10 houqp

@ritchie46 I sent my PoC to https://github.com/delta-io/delta-rs/pull/465.

houqp avatar Oct 17 '21 23:10 houqp

@ritchie46 I sent my PoC to #465.

Wow, you were even able to feature gate it. kudos! :raised_hands:

ritchie46 avatar Oct 18 '21 06:10 ritchie46

@ritchie46 and @houqp thanks a lot for pushing the PoC. I was wondering if this is feature complete and any example I could refer to get started?

mohitreddy1996 avatar Jan 20 '23 02:01 mohitreddy1996

@mohitreddy1996 - if it is about reading delta tables with polars, the methods read_delta and scan_delta were recently added.

Beyond that we are also looking into making internal data-processing more flexible in terms of the processing engine being used. To make this work with polars the arrow2/parquet2 creates need to be used internally, which is possible today. But fully integrating polars internally still requires some major work and we unfortunately cannot give a concrete date if and when that will be supported.

roeap avatar Jan 23 '23 07:01 roeap

@mohitreddy1996 - if it is about reading delta tables with polars, the methods read_delta and scan_delta were recently added.

Beyond that we are also looking into making internal data-processing more flexible in terms of the processing engine being used. To make this work with polars the arrow2/parquet2 creates need to be used internally, which is possible today. But fully integrating polars internally still requires some major work and we unfortunately cannot give a concrete date if and when that will be supported.

Are there still considerations to move to using polars for internal operations that require a query execution engine?

ion-elgreco avatar Jul 28 '23 18:07 ion-elgreco

Are there still considerations to move to using polars for internal operations that require a query execution engine?

I don't think we have any plans to more forwards on this.

I don't think we have too much to gain by moving from using DataFusion to Polars. DataFusion's primary user base is engineers building systems, whereas IMO Polars is much more focused on providing interactive APIs for end users; IMO DataFusion will serve us better in the long term. Plus Polars supports a more narrow range of Arrow data types than DataFusion, so it might present some challenges when integrating with the rest of the Arrow ecosystem.

If no one objects, I'd like to close this issue.

wjones127 avatar Jul 28 '23 18:07 wjones127

Are there still considerations to move to using polars for internal operations that require a query execution engine?

I don't think we have any plans to more forwards on this.

I don't think we have too much to gain by moving from using DataFusion to Polars. DataFusion's primary user base is engineers building systems, whereas IMO Polars is much more focused on providing interactive APIs for end users; IMO DataFusion will serve us better in the long term. Plus Polars supports a more narrow range of Arrow data types than DataFusion, so it might present some challenges when integrating with the rest of the Arrow ecosystem.

If no one objects, I'd like to close this issue.

How far were you with the ADBC implementation? I remember that in the design document this will allow the executions to be passed to other ADBC compatible libraries. If that's still the plan, then you can probably close this issue.

ion-elgreco avatar Jul 29 '23 06:07 ion-elgreco

+1 for closing this for now as per @wjones127's comments.

roeap avatar Aug 10 '23 18:08 roeap