polars icon indicating copy to clipboard operation
polars copied to clipboard

Are there plans to support delta reader/writer?

Open francisco-ltech opened this issue 3 years ago • 16 comments

Any integration with delta lake in the horizon by any chance? https://delta.io/

There is a native delta lake implementation in Rust https://github.com/delta-io/delta-rs/tree/main/rust

francisco-ltech avatar Mar 08 '22 15:03 francisco-ltech

I think it will be very unlikely. As far as I can see delta lake does not use the Arrow format, but requires spark .

You can use to read the data in a pyarrow table which you then can convert to a polars dataframe.

Seems like there are 2 python packages to do it (which seem to have the same name):

https://databricks.com/blog/2020/12/22/natively-query-your-delta-lake-with-scala-java-and-python.html https://github.com/delta-io/delta-rs/tree/main/python

import polars as pl

# Import Delta Table
from deltalake import DeltaTable

# Read the Delta Table using the Rust API
dt = DeltaTable("../rust/tests/data/simple_table")

# Create a Polars Dataframe by initially converting the Delta Lake
# table into a PyArrow table
df = pl.DataFrame(dt.to_pyarrow_table())

https://pypi.org/project/delta-lake-reader/

import polars as pl

from deltalake import DeltaTable

# native file path. Can be relative or absolute
table_path = "somepath/mytable"

# Create a Polars Dataframe by initially converting the Delta Lake
# table into a PyArrow table.
df = pl.DataFrame(DeltaTable(table_path).to_table())

ghuls avatar Mar 08 '22 16:03 ghuls

There were plans to do so. But delta-rs is based on arrow-rs and polars uses arrow2, so that are some difficulties.

ritchie46 avatar Mar 09 '22 07:03 ritchie46

cc @houqp

jorgecarleitao avatar Mar 09 '22 07:03 jorgecarleitao

There is ongoing work to migrate delta-rs to arrow2 and parquet2, see: https://github.com/delta-io/delta-rs/pull/465. The current branch is mostly complete except map and list type suport. We also need to update to the latest arrow2/parquet2 version :D Once the port is completed, plugging it into polars should be pretty trivia.

houqp avatar Mar 09 '22 07:03 houqp

I'm also looking forward for the Delta Lake support!

andrei-ionescu avatar Jul 01 '22 15:07 andrei-ionescu

@ritchie46 - a new version of delta-rs was recently released with parquet2 support, see here. Thanks for adding this @houqp! Will you be able to add delta-rs support now?

MrPowers avatar Sep 02 '22 11:09 MrPowers

Also want to say this would be great. I bet your implementation will be close to fast as the photon compute engine Databricks charges way too much for.

esadler-hbo avatar Sep 20 '22 03:09 esadler-hbo

Is anyone willing to take on this work? There are a lot of delta-rs developers that are willing to help with code reviews and any issues you might come across. Feel free to ping me directly or here if you're interested.

MrPowers avatar Sep 21 '22 01:09 MrPowers

I didn't realize that my playpen pushes would link back here 😂 @MrPowers I am not really sure what my next steps are, let me know if you are still available for some guidance on how to implement this feature.

winding-lines avatar Nov 26 '22 15:11 winding-lines

I'm working in this feature, will raise a PR soon.

chitralverma avatar Dec 09 '22 12:12 chitralverma

@chitralverma - that's awesome. Let me know if you need any help. We can jump on a call with the core delta-rs devs anytime. Really excited about this feature. I'll blog / promote it as soon as it is live 🚀

MrPowers avatar Dec 09 '22 13:12 MrPowers

@chitralverma - that's awesome. Let me know if you need any help. We can jump on a call with the core delta-rs devs anytime. Really excited about this feature. I'll blog / promote it as soon as it is live 🚀

Hi @MrPowers , thanks for the encouragement. :) I have raised #5761 as a draft PR. Since this is my first contribution to Polars, I'm expecting some quite some review comments.

If the initial idea is reviewed and is considered as the way to go, then I plan on adding the following as well,

  • Reading delta tables from supported catalogs
  • More documentation and examples
  • Unit test cases (of course)

@ritchie46 need your guidance for this as well when ever you have some time.

Thanks.

chitralverma avatar Dec 09 '22 19:12 chitralverma

Update: PR ready for review.

chitralverma avatar Dec 10 '22 18:12 chitralverma

Update: The read_delta and scan_delta functionalities are merged via #5761 ! 🎊

https://pola-rs.github.io/polars/py-polars/html/reference/io.html#delta-lake

chitralverma avatar Dec 11 '22 19:12 chitralverma

hi ! Thanks for this amazing feature . What about the writer function ? I would love to avoid spark and only use rust.

dridk avatar Dec 18 '22 12:12 dridk

hi ! Thanks for this amazing feature . What about the writter function ? I would love to avoid spark and only use rust.

I was working on it, but the plan is to put it on python side.

but then, for now it's blocked by https://github.com/delta-io/delta-rs/issues/1024

delta-rs doesn't support large_string type

chitralverma avatar Dec 18 '22 17:12 chitralverma

Any update on this ? waiting for this feature

lordirah avatar Feb 18 '23 17:02 lordirah

Any update on this ? waiting for this feature

the lazy/ eager reader is already in place.

for the writer, #7574 is now open

chitralverma avatar Mar 17 '23 13:03 chitralverma

I'll close this in favor of https://github.com/pola-rs/polars/issues/7574 , as it's more specific.

read/scan functionality has been implemented, write functionality is being worked on.

stinodego avatar Mar 26 '23 04:03 stinodego

@stinodego do you know if there is any plan to support streaming?

abiratsis avatar Sep 16 '23 10:09 abiratsis

Yes, it's planned but won't be there anytime soon: https://github.com/pola-rs/polars/issues/11039

stinodego avatar Sep 16 '23 12:09 stinodego