ray icon indicating copy to clipboard operation
ray copied to clipboard

[Datasets] Add Polars backend for tabular aggregations

Open clarkzinzow opened this issue 2 years ago • 9 comments

This PR adds prototype support for using Polars for tabular aggregations, which should not only provide vectorized execution of column-wise aggregations (currently implemented) but should also provide column-wise parallelism for multi-aggregations and group-wise parallelism for each aggregation, among other optimizations.

Brief performance testing on a single aggregation over 10M rows shows an over 20x speedup:

In [8]: ctx.use_polars = False

In [9]: %time agg_ds = ds.groupby("A").sum("B")
CPU times: user 603 ms, sys: 88.3 ms, total: 691 ms
Wall time: 20.3 s

In [10]: ctx.use_polars = True

In [11]: %time agg_ds = ds.groupby("A").sum("B")
CPU times: user 303 ms, sys: 21.1 ms, total: 324 ms
Wall time: 742 ms

Related issue number

Closes #26131

TODOs

  • [x] Support ignore_nulls=False.
  • [x] Port remaining aggregations.
  • [x] Provide a better (user-facing) API for custom aggregations.
  • [x] Update Polars to include fix for allowing null propagation in partial aggregations involving struct types.
  • [ ] Add test including tensor extensions to test Arrow2's extension type handling.

clarkzinzow avatar Apr 28 '22 00:04 clarkzinzow

@clarkzinzow https://github.com/pola-rs/polars/pull/3248 has been released in polars==0.13.28. This unblocks the ignore_nulls=False branch.

ritchie46 avatar May 02 '22 07:05 ritchie46

@ritchie46 Awesome! 🙌

clarkzinzow avatar May 02 '22 13:05 clarkzinzow

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

stale[bot] avatar Jun 10 '22 18:06 stale[bot]

Will be coming back to this in the next week or so.

clarkzinzow avatar Jun 10 '22 19:06 clarkzinzow

Could we reorganize the code so that polars specific code is isolated to _internal/arrow_ops/aggregate_polars.py? We can follow the a similar convention as used for sort, where there is an arrow_ops impl in one file and an equivalent polars_ops impl in the other file.

@ericl Yep I thought about doing that, we will still need to expose some things such as PolarsAggregation to the user in order to support custom Polars aggregations, but we can isolate the rest of it.

clarkzinzow avatar Jun 22 '22 06:06 clarkzinzow

@ericl Hmm upon a second look, unlike the sort implementations, the Arrow groupby implementation makes heavy use of the ArrowBlockAccessor APIs (.iter_rows(), ._munge_conflict(), .slice()) as well as ArrowBlockBuilder; the former would require either a weird upwards dependency on ArrowBlockAccessor or another implementation of .iter_rows() and .slice(), and the latter would require another upwards dependency on ArrowBlockBuilder. It's not a clean Arrow compute vs. Polars operations like sort, unfortunately.

I'll try to think of a good way to isolate the Polars vs. Arrow implementation without weird bidirectional dependencies or code duplication.

clarkzinzow avatar Jun 23 '22 00:06 clarkzinzow

Yeah, the other alternative I was thinking is to have two separate types of Block accessors entirely.

On Wed, Jun 22, 2022, 5:39 PM Clark Zinzow @.***> wrote:

@ericl https://github.com/ericl Hmm upon a second look, unlike the sort implementations, the Arrow groupby implementation makes heavy use of the ArrowBlockAccessor APIs (.iter_rows(), ._munge_conflict(), .slice()) as well as ArrowBlockBuilder; the former would require either a weird upwards dependency on ArrowBlockAccessor or another implementation of .iter_rows() and .slice(), and the latter would require another upwards dependency on ArrowBlockBuilder. It's not a clean Arrow compute vs. Polars operations like sort, unfortunately.

I'll try to think of a good way to isolate the Polars vs. Arrow implementation without weird bidirectional dependencies or code duplication.

— Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/pull/24282#issuecomment-1163802172, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUSSUWVXNB6EQHBAIQSDVQOW4JANCNFSM5UQYZXPQ . You are receiving this because you were mentioned.Message ID: @.***>

ericl avatar Jun 23 '22 00:06 ericl

The underlying data representation is the same (Arrow), Polars is just offering a suite of operations on top of Arrow data, so I don't think that having another block accessor would make sense.

clarkzinzow avatar Jun 23 '22 01:06 clarkzinzow

I see. I think this is pointing to a larger code structure issue that our data format (ArrowBlock) should not also define execution methods (e.g., combine, etc.).

What if we moved the execution impls into a separate class, such as ArrowBlockOps? The we can have a NativeArrowBlockOpsImpl(ArrowBlockOps), and PolarsArrowBlockOpsImpl(ArrowBlockOps), which would cleanly separate the high level aggregation code into two modules.

For the aggregation definitions, I'm not sure I have a better idea than the current as_polars() strategy, maybe that's fine.

ericl avatar Jun 24 '22 21:06 ericl

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

stale[bot] avatar Oct 22 '22 18:10 stale[bot]

Hi, @clarkzinzow thanks a lot for this PR. I'd like to ask if you still work on this and when it will be released ?

qooba avatar Oct 30 '22 22:10 qooba

Hi @qooba, we were momentarily blocked by a few Polars bugs but I believe that we're now nearly unblocked! All that remains (AFAIK) is fixing the multi-block global standard deviation implementation, which appears to be buggy. I just rebased on master, I'll be looking at this again early next week.

My hope is to have it land in master in the next week or so, so that it will land in Ray 2.2.

clarkzinzow avatar Nov 04 '22 23:11 clarkzinzow

Hi @qooba, we were momentarily blocked by a few Polars bugs but I believe that we're now nearly unblocked! All that remains (AFAIK) is fixing the multi-block global standard deviation implementation, which appears to be buggy. I just rebased on master, I'll be looking at this again early next week.

My hope is to have it land in master in the next week or so, so that it will land in Ray 2.2.

Awesome, thanks a lot :)

qooba avatar Nov 05 '22 09:11 qooba

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

stale[bot] avatar Dec 11 '22 08:12 stale[bot]

@clarkzinzow did you try updating the polars version to see if they solve the bugs? If there is something specific that you want to be fixed you can also raise issues for this on polars side.

I see from your commit that this PR is still using 0.14.x but current latest version is 0.15.x

chitralverma avatar Dec 20 '22 10:12 chitralverma

Hi @chitralverma! I've been in contact with @ritchie46 throughout this prototyping effort, and all outstanding Polars-side bugs are resolved. The only remaining TODO for this PR is to decompose it into more reviewable stacked PRs, although we currently plan to merge this feature-flagged and off by default until we have some comprehensive groupby + aggregation benchmarks, so we can clearly demonstrate the performance benefits of the Polars backend.

Getting this across the line is currently a low-priority background task since (1) we're blocked from enabling it by default by having the benchmarks, (2) tabular data is less of a focus for us right now than NLP and CV, and (3) groupby + aggregations is less of a focus for us right now than narrow parallel transformations.

clarkzinzow avatar Dec 20 '22 17:12 clarkzinzow

Hi @chitralverma! I've been in contact with @ritchie46 throughout this prototyping effort, and all outstanding Polars-side bugs are resolved. The only remaining TODO for this PR is to decompose it into more reviewable stacked PRs, although we currently plan to merge this feature-flagged and off by default until we have some comprehensive groupby + aggregation benchmarks, so we can clearly demonstrate the performance benefits of the Polars backend.

Getting this across the line is currently a low-priority background task since (1) we're blocked from enabling it by default by having the benchmarks, (2) tabular data is less of a focus for us right now than NLP and CV, and (3) groupby + aggregations is less of a focus for us right now than narrow parallel transformations.

Got it , thanks!

chitralverma avatar Dec 20 '22 18:12 chitralverma

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

stale[bot] avatar Jan 20 '23 23:01 stale[bot]

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

stale[bot] avatar Feb 04 '23 20:02 stale[bot]

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

stale[bot] avatar Mar 11 '23 13:03 stale[bot]

Great work here, what's the status of this regarding the next release?

omidb avatar Mar 27 '23 14:03 omidb

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

stale[bot] avatar May 01 '23 13:05 stale[bot]

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

stale[bot] avatar May 18 '23 23:05 stale[bot]

Hi, Any updates on this?

aeroaks avatar Jul 13 '23 14:07 aeroaks