polars
polars copied to clipboard
Generalize object_store integration to all supported cloud providers
Problem description
The current integration of the object_store crate enables parquet download from S3. Generalize the integration so that parquet files can be downloaded from any of the supported object_store cloud providers.
Change the python API to defer to cloud download operations to the rust side for the supported cloud providers. The rationale for this is that the Polars planner can optimize the downloads in some important use cases:
- for parquet files it can download only the columns of interests
- it can leverage per column stats to further reduce downloads
- if can leverage the directory structure when the hive format is used (
/field=value/)
This is running in some small issues here https://github.com/apache/arrow-rs/issues/3419
Hi, awesome feature! I see that it has made it's way into the rust codebase already, is this the issue to watch for Python API support, or is there a separate ticket for that?
In progress, no ticket. There were some limitations in the GCP interface on the object_store side, so I got sidetracked for the last 2 weekends.
Gotcha, thanks for the update. This a super powerful/useful feature. Looking forward to trying it out on the Python side.
Made a bit more progress on this, see https://github.com/pola-rs/polars/pull/6426
This has been implemented.