polars
polars copied to clipboard
Are there plans to support delta reader/writer?
Any integration with delta lake in the horizon by any chance? https://delta.io/
There is a native delta lake implementation in Rust https://github.com/delta-io/delta-rs/tree/main/rust
I think it will be very unlikely. As far as I can see delta lake does not use the Arrow format, but requires spark .
You can use to read the data in a pyarrow table which you then can convert to a polars dataframe.
Seems like there are 2 python packages to do it (which seem to have the same name):
https://databricks.com/blog/2020/12/22/natively-query-your-delta-lake-with-scala-java-and-python.html https://github.com/delta-io/delta-rs/tree/main/python
import polars as pl
# Import Delta Table
from deltalake import DeltaTable
# Read the Delta Table using the Rust API
dt = DeltaTable("../rust/tests/data/simple_table")
# Create a Polars Dataframe by initially converting the Delta Lake
# table into a PyArrow table
df = pl.DataFrame(dt.to_pyarrow_table())
https://pypi.org/project/delta-lake-reader/
import polars as pl
from deltalake import DeltaTable
# native file path. Can be relative or absolute
table_path = "somepath/mytable"
# Create a Polars Dataframe by initially converting the Delta Lake
# table into a PyArrow table.
df = pl.DataFrame(DeltaTable(table_path).to_table())
There were plans to do so. But delta-rs is based on arrow-rs and polars uses arrow2, so that are some difficulties.
cc @houqp
There is ongoing work to migrate delta-rs to arrow2 and parquet2, see: https://github.com/delta-io/delta-rs/pull/465. The current branch is mostly complete except map and list type suport. We also need to update to the latest arrow2/parquet2 version :D Once the port is completed, plugging it into polars should be pretty trivia.
I'm also looking forward for the Delta Lake support!
@ritchie46 - a new version of delta-rs was recently released with parquet2 support, see here. Thanks for adding this @houqp! Will you be able to add delta-rs support now?
Also want to say this would be great. I bet your implementation will be close to fast as the photon compute engine Databricks charges way too much for.
Is anyone willing to take on this work? There are a lot of delta-rs developers that are willing to help with code reviews and any issues you might come across. Feel free to ping me directly or here if you're interested.
I didn't realize that my playpen pushes would link back here 😂 @MrPowers I am not really sure what my next steps are, let me know if you are still available for some guidance on how to implement this feature.
I'm working in this feature, will raise a PR soon.
@chitralverma - that's awesome. Let me know if you need any help. We can jump on a call with the core delta-rs devs anytime. Really excited about this feature. I'll blog / promote it as soon as it is live 🚀
@chitralverma - that's awesome. Let me know if you need any help. We can jump on a call with the core delta-rs devs anytime. Really excited about this feature. I'll blog / promote it as soon as it is live 🚀
Hi @MrPowers , thanks for the encouragement. :) I have raised #5761 as a draft PR. Since this is my first contribution to Polars, I'm expecting some quite some review comments.
If the initial idea is reviewed and is considered as the way to go, then I plan on adding the following as well,
- Reading delta tables from supported catalogs
- More documentation and examples
- Unit test cases (of course)
@ritchie46 need your guidance for this as well when ever you have some time.
Thanks.
Update: PR ready for review.
Update: The read_delta and scan_delta functionalities are merged via #5761 ! 🎊
https://pola-rs.github.io/polars/py-polars/html/reference/io.html#delta-lake
hi ! Thanks for this amazing feature . What about the writer function ? I would love to avoid spark and only use rust.
hi ! Thanks for this amazing feature . What about the writter function ? I would love to avoid spark and only use rust.
I was working on it, but the plan is to put it on python side.
but then, for now it's blocked by https://github.com/delta-io/delta-rs/issues/1024
delta-rs doesn't support large_string type
Any update on this ? waiting for this feature
Any update on this ? waiting for this feature
the lazy/ eager reader is already in place.
for the writer, #7574 is now open
I'll close this in favor of https://github.com/pola-rs/polars/issues/7574 , as it's more specific.
read/scan functionality has been implemented, write functionality is being worked on.
@stinodego do you know if there is any plan to support streaming?
Yes, it's planned but won't be there anytime soon: https://github.com/pola-rs/polars/issues/11039