distributed icon indicating copy to clipboard operation
distributed copied to clipboard

Polars DataFrame Support

Open th3ed opened this issue 3 years ago • 8 comments

Polars is a single-machine columnar dataframe library built on top of pyarrow Tables with both a eager and lazy API. While the default pickle serialization works fine for eagerly-created dataframes, performance could be improved by leveraging the same IPC serialization pyarrow Tables currently have. Additionally, LazyFrames are not currently serializable, but contain an underlying json plan that could be passed over the wire.

th3ed avatar Oct 30 '22 21:10 th3ed

@ritchie46 and I spent a couple days and put together a very brief POC on this over at https://github.com/pola-rs/dask-polars

We paused due to lack of users asking for it. It's nice to see users ask for it.

Can I ask your motivation here? Is it mostly data access / serialization pain, or are you mostly looking for performance?

mrocklin avatar Oct 31 '22 13:10 mrocklin

Performance is the main motivation:

  • most of the data we work with is in parquet format where conversion from columnar to row-major pandas format tends to be a major time and memory bottleneck both in reading and writing these files
  • native support for the nullable spark dtypes present in our parquet files
  • better compression for columnar data means less network transfer in the cluster
  • Polars has a nice lazy API that includes query-plan optimization and robust column and row pushdown filtering. While dask currently supports column pushdowns, there might be an opportunity to leverage polars optimized query plans here (kind of like how dask-sql uses datafusion for the same purpose)

What polars lacks are distributed execution, out-of-core/memory support and the ability to work with other python objects which are areas dask excels in

th3ed avatar Oct 31 '22 14:10 th3ed

most of the data we work with is in parquet format where conversion from columnar to row-major pandas format tends to be a major time and memory bottleneck both in reading and writing these files

I would be surprised by this. Pandas is also column-major. Also, in-memory flips like this are rarely performance bottlenecks. Reading from disk/S3 is likely to be slower.

native support for the nullable spark dtypes present in our parquet files

Makes sense. This should end up working in pandas + dask soon regardless.

better compression for columnar data means less network transfer in the cluster

Dask compresses all data across the network with fast compression if available

Polars has a nice lazy API that includes query-plan optimization and robust column and row pushdown filtering. While dask currently supports column pushdowns, there might be an opportunity to leverage polars optimized query plans here (kind of like how dask-sql uses datafusion for the same purpose)

Yeah, this would be good. Dask should also get something like this sometime next year, but it'll never do the hardcore multiple-operations-in-single-pass-over-memory stuff that I suspect Polars does.

Dask+Polars is doable and would be fun. I'm game to support something like this. My guess though is that it'll be faster to get Dask+Pandas that meets most users needs faster. The things that you bring up are valid, but also mostly handled or being handled actively (with the exception of query planning).

mrocklin avatar Oct 31 '22 14:10 mrocklin

What polars lacks are distributed execution, out-of-core/memory support and the ability to work with other python objects which are areas dask excels in

Out-of-core/memory of most queries is only a month or two away :slightly_smiling_face: : https://github.com/pola-rs/polars/pull/5339, https://github.com/pola-rs/polars/pull/5139.

Distributed not so much. I think a dask+polars has the potential to really save cloud costs as polars' really is able to utilize a nodes potential with regard to query time with low overhead.

ritchie46 avatar Oct 31 '22 15:10 ritchie46

Polars has a nice lazy API that includes query-plan optimization and robust column and row pushdown filtering. While dask currently supports column pushdowns, there might be an opportunity to leverage polars optimized query plans here (kind of like how dask-sql uses datafusion for the same purpose)

Several of these concepts are implemented in Dask-SQL, but plan optimization will be work in progress for some time. I wonder if being able to swap Pandas for Polars in Dask would yield some improvements for Dask-SQL queries.

randerzander avatar Nov 01 '22 14:11 randerzander

I am working a lot with fairly large Delta Tables (from 50GB to 40TB). I recently tried to read and print a sample table (90GB) with dask, polars and delta-rs on a single machine (32 Cores).

These were my findings:

Version Time in s Max Mem. in GiB
dask 234.23 ~105
dask (scheduler="processes") 134.3 ~102
polars 21.23 ~95
polars(+to_pandas) 40.11 ~110
delta-rs(to_pandas) 235.28 ~100

Tried many different variations but polars seems to be doing something better here. At least on single machines these results convinced me to switch a number of my processes from pandas to polars. I would love to see how this would perform in a distributed setup... which would be much more beneficial for my use cases.

glxplz avatar Nov 03 '22 18:11 glxplz

Just throwing my use case out there: I am processing a lot of modest sized parquet files (5-100MB) in parallel due to IO blocks and would love to distribute the work of doing so across a Dask cluster whilst benefiting from polars' robust performance and memory characteristics

eddie-atkinson avatar Apr 29 '23 06:04 eddie-atkinson

I see this thread has been open for quite some years and has not captured much attention. Which surprises me; for me dask and polars are complementary and do not compete. If you ask me the performance bump that Dask could, theoretically, benefit from is huge as Polars is so much faster than pandas. Given that Polars is aiming for the monetising model of a distributed offering, it seems to me wildly competitive that Dask embraces this instead. Coiled has been in the works for some time yet and it looks really capable.

I wonder than; why doesn’t dash / coiled embrace Polars as the underlying data frame instead of pandas? From a performance point of view it seems like a no brainier and a market competition point of view likewise. Is it the license? Are there underlying too many differences that would break the architecture of Dask? A thing i can imagine is that the out-of-core logic might compete with that or dask?

Either way, all the times in the past (5y ago) I’ve worked with dask and fiddled with the things under the hood always have me the feeling it’s a rather elegant architecture that could leverage this, as shown by @mrocklin his POC.

Curious to hear the thoughts!

Also, my use case would be is to try to run it on Databricks clusters to see how much faster it can he if it uses the whole cluster instead of single node. Polars has already proven to best PySpar

hetspookjee avatar Nov 20 '25 07:11 hetspookjee