polars icon indicating copy to clipboard operation
polars copied to clipboard

Generalize object_store integration to all supported cloud providers

Open winding-lines opened this issue 2 years ago • 5 comments

Problem description

The current integration of the object_store crate enables parquet download from S3. Generalize the integration so that parquet files can be downloaded from any of the supported object_store cloud providers.

Change the python API to defer to cloud download operations to the rust side for the supported cloud providers. The rationale for this is that the Polars planner can optimize the downloads in some important use cases:

  1. for parquet files it can download only the columns of interests
  2. it can leverage per column stats to further reduce downloads
  3. if can leverage the directory structure when the hive format is used (/field=value/)

winding-lines avatar Dec 30 '22 15:12 winding-lines

This is running in some small issues here https://github.com/apache/arrow-rs/issues/3419

winding-lines avatar Dec 31 '22 15:12 winding-lines

Hi, awesome feature! I see that it has made it's way into the rust codebase already, is this the issue to watch for Python API support, or is there a separate ticket for that?

talawahtech avatar Jan 23 '23 02:01 talawahtech

In progress, no ticket. There were some limitations in the GCP interface on the object_store side, so I got sidetracked for the last 2 weekends.

winding-lines avatar Jan 23 '23 02:01 winding-lines

Gotcha, thanks for the update. This a super powerful/useful feature. Looking forward to trying it out on the Python side.

talawahtech avatar Jan 23 '23 02:01 talawahtech

Made a bit more progress on this, see https://github.com/pola-rs/polars/pull/6426

winding-lines avatar Jan 25 '23 05:01 winding-lines

This has been implemented.

stinodego avatar Mar 29 '24 11:03 stinodego