ray
ray copied to clipboard
[Datasets] Add Polars backend for tabular aggregations
This PR adds prototype support for using Polars for tabular aggregations, which should not only provide vectorized execution of column-wise aggregations (currently implemented) but should also provide column-wise parallelism for multi-aggregations and group-wise parallelism for each aggregation, among other optimizations.
Brief performance testing on a single aggregation over 10M rows shows an over 20x speedup:
In [8]: ctx.use_polars = False
In [9]: %time agg_ds = ds.groupby("A").sum("B")
CPU times: user 603 ms, sys: 88.3 ms, total: 691 ms
Wall time: 20.3 s
In [10]: ctx.use_polars = True
In [11]: %time agg_ds = ds.groupby("A").sum("B")
CPU times: user 303 ms, sys: 21.1 ms, total: 324 ms
Wall time: 742 ms
Related issue number
Closes #26131
TODOs
- [x] Support
ignore_nulls=False
. - [x] Port remaining aggregations.
- [x] Provide a better (user-facing) API for custom aggregations.
- [x] Update Polars to include fix for allowing null propagation in partial aggregations involving struct types.
- [ ] Add test including tensor extensions to test Arrow2's extension type handling.
@clarkzinzow https://github.com/pola-rs/polars/pull/3248 has been released in polars==0.13.28
. This unblocks the ignore_nulls=False
branch.
@ritchie46 Awesome! 🙌
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
- If you'd like to keep this open, just leave any comment, and the stale label will be removed.
Will be coming back to this in the next week or so.
Could we reorganize the code so that polars specific code is isolated to _internal/arrow_ops/aggregate_polars.py? We can follow the a similar convention as used for sort, where there is an arrow_ops impl in one file and an equivalent polars_ops impl in the other file.
@ericl Yep I thought about doing that, we will still need to expose some things such as PolarsAggregation
to the user in order to support custom Polars aggregations, but we can isolate the rest of it.
@ericl Hmm upon a second look, unlike the sort implementations, the Arrow groupby implementation makes heavy use of the ArrowBlockAccessor
APIs (.iter_rows()
, ._munge_conflict()
, .slice()
) as well as ArrowBlockBuilder
; the former would require either a weird upwards dependency on ArrowBlockAccessor
or another implementation of .iter_rows()
and .slice()
, and the latter would require another upwards dependency on ArrowBlockBuilder
. It's not a clean Arrow compute vs. Polars operations like sort, unfortunately.
I'll try to think of a good way to isolate the Polars vs. Arrow implementation without weird bidirectional dependencies or code duplication.
Yeah, the other alternative I was thinking is to have two separate types of Block accessors entirely.
On Wed, Jun 22, 2022, 5:39 PM Clark Zinzow @.***> wrote:
@ericl https://github.com/ericl Hmm upon a second look, unlike the sort implementations, the Arrow groupby implementation makes heavy use of the ArrowBlockAccessor APIs (.iter_rows(), ._munge_conflict(), .slice()) as well as ArrowBlockBuilder; the former would require either a weird upwards dependency on ArrowBlockAccessor or another implementation of .iter_rows() and .slice(), and the latter would require another upwards dependency on ArrowBlockBuilder. It's not a clean Arrow compute vs. Polars operations like sort, unfortunately.
I'll try to think of a good way to isolate the Polars vs. Arrow implementation without weird bidirectional dependencies or code duplication.
— Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/pull/24282#issuecomment-1163802172, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUSSUWVXNB6EQHBAIQSDVQOW4JANCNFSM5UQYZXPQ . You are receiving this because you were mentioned.Message ID: @.***>
The underlying data representation is the same (Arrow), Polars is just offering a suite of operations on top of Arrow data, so I don't think that having another block accessor would make sense.
I see. I think this is pointing to a larger code structure issue that our data format (ArrowBlock) should not also define execution methods (e.g., combine, etc.).
What if we moved the execution impls into a separate class, such as ArrowBlockOps
? The we can have a NativeArrowBlockOpsImpl(ArrowBlockOps)
, and PolarsArrowBlockOpsImpl(ArrowBlockOps)
, which would cleanly separate the high level aggregation code into two modules.
For the aggregation definitions, I'm not sure I have a better idea than the current as_polars() strategy, maybe that's fine.
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
- If you'd like to keep this open, just leave any comment, and the stale label will be removed.
Hi, @clarkzinzow thanks a lot for this PR. I'd like to ask if you still work on this and when it will be released ?
Hi @qooba, we were momentarily blocked by a few Polars bugs but I believe that we're now nearly unblocked! All that remains (AFAIK) is fixing the multi-block global standard deviation implementation, which appears to be buggy. I just rebased on master, I'll be looking at this again early next week.
My hope is to have it land in master in the next week or so, so that it will land in Ray 2.2.
Hi @qooba, we were momentarily blocked by a few Polars bugs but I believe that we're now nearly unblocked! All that remains (AFAIK) is fixing the multi-block global standard deviation implementation, which appears to be buggy. I just rebased on master, I'll be looking at this again early next week.
My hope is to have it land in master in the next week or so, so that it will land in Ray 2.2.
Awesome, thanks a lot :)
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
- If you'd like to keep this open, just leave any comment, and the stale label will be removed.
@clarkzinzow did you try updating the polars version to see if they solve the bugs? If there is something specific that you want to be fixed you can also raise issues for this on polars side.
I see from your commit that this PR is still using 0.14.x but current latest version is 0.15.x
Hi @chitralverma! I've been in contact with @ritchie46 throughout this prototyping effort, and all outstanding Polars-side bugs are resolved. The only remaining TODO for this PR is to decompose it into more reviewable stacked PRs, although we currently plan to merge this feature-flagged and off by default until we have some comprehensive groupby + aggregation benchmarks, so we can clearly demonstrate the performance benefits of the Polars backend.
Getting this across the line is currently a low-priority background task since (1) we're blocked from enabling it by default by having the benchmarks, (2) tabular data is less of a focus for us right now than NLP and CV, and (3) groupby + aggregations is less of a focus for us right now than narrow parallel transformations.
Hi @chitralverma! I've been in contact with @ritchie46 throughout this prototyping effort, and all outstanding Polars-side bugs are resolved. The only remaining TODO for this PR is to decompose it into more reviewable stacked PRs, although we currently plan to merge this feature-flagged and off by default until we have some comprehensive groupby + aggregation benchmarks, so we can clearly demonstrate the performance benefits of the Polars backend.
Getting this across the line is currently a low-priority background task since (1) we're blocked from enabling it by default by having the benchmarks, (2) tabular data is less of a focus for us right now than NLP and CV, and (3) groupby + aggregations is less of a focus for us right now than narrow parallel transformations.
Got it , thanks!
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
- If you'd like to keep this open, just leave any comment, and the stale label will be removed.
Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.
Please feel free to reopen or open a new issue if you'd still like it to be addressed.
Again, you can always ask for help on our discussion forum or Ray's public slack channel.
Thanks again for opening the issue!
Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.
Please feel free to reopen or open a new issue if you'd still like it to be addressed.
Again, you can always ask for help on our discussion forum or Ray's public slack channel.
Thanks again for opening the issue!
Great work here, what's the status of this regarding the next release?
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
- If you'd like to keep this open, just leave any comment, and the stale label will be removed.
Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.
Please feel free to reopen or open a new issue if you'd still like it to be addressed.
Again, you can always ask for help on our discussion forum or Ray's public slack channel.
Thanks again for opening the issue!
Hi, Any updates on this?