python-bigquery-dataframes icon indicating copy to clipboard operation
python-bigquery-dataframes copied to clipboard

Polars Support

Open firmai opened this issue 1 year ago • 5 comments

It would be great to offer Polars support, it is currently half as popular as Pandas, and generally work better for large datasets. Polars is bound to replace most data-scientist day to day operations within the next five years.

Thanks for developing bigframes, it very useful.

firmai avatar May 30 '24 15:05 firmai

What kind of polars support would you find useful? Would you want BigQuery DataFrames to have an polars-like DataFrame API (as an alternative to the current pandas-like one) or simply interop with polars objects more easily?

TrevorBergeron avatar May 30 '24 19:05 TrevorBergeron

I would like automatic schema supply, this is currently the limiting step in automatically uploading Polars DataFrames: write_ndjson seems to be the only way I can upload list dtypes (Parquet seems to not be viable, see this issue), but NDJSON requires the schema to be passed. I'm really looking for something that will just let me put my Polars DataFrame in a BQ table without fiddling with schemas: there should be enough info already here to do that for me.

lmmx avatar Jun 13 '24 16:06 lmmx

For going from BigQuery DataFrames to polars, I'm adding a to_arrow method in https://github.com/googleapis/python-bigquery-dataframes/pull/807 as well as an example for how to create a polars DataFrame from the results.

tswast avatar Jun 26 '24 21:06 tswast

For uploading to BigQuery, I have updated the polars docs to indicate how to get BigQuery to correctly handle list types https://github.com/pola-rs/polars/pull/20292

I think that read_polars and to_polars methods would be reasonable requests for bigframes. I have done some refactoring recently to our I/O that might make it a bit easier, but would probably require a little more refactoring to have pyarrow tables/recordbatches as the intermediate format instead of pandas dataframes. The other thing to be careful about is that polars would be an optional "extra" dependency in setup.py to avoid a hard dependency on the polars package.

Edit: Or at the very least, a read_arrow(...) method to correspond to the to_arrow() I implemented in #807. There are fewer concerns with depending on pyarrow in bigframes because we already have that as a required dependency.

tswast avatar Jan 03 '25 16:01 tswast

Amazing! Should this issue be closed now?

lmmx avatar Feb 13 '25 16:02 lmmx

I just mailed https://github.com/googleapis/python-bigquery-dataframes/pull/1855 with bpd.read_arrow(pyarrow.Table) to round out the other side of this conversion.

Technically I think this was possible before by going through the DataFrame constructor, but that ended up translating to pandas as an in-between layer. Now we can just go from polars -> Arrow -> BigFrames without pandas in the middle.

tswast avatar Jun 26 '25 15:06 tswast