cudf
cudf copied to clipboard
Added 'crosstab' and 'pivot_table' features
Resolves https://github.com/rapidsai/cudf/issues/5944, resolves https://github.com/rapidsai/cudf/issues/1214
This PR adds support for basic use cases of crosstab
and pivot_table
functions.
Hi @shaswat-indian - thanks for taking this on! The approach taken here could be improved, particularly by relying less on Pandas internals and relying instead on cuDF's internals more. I'd be happy to go over the code with you offline before coming back here and doing another review!
@shaswat-indian @shwina I haven't looked at this code at all, but just FYI you may also benefit from looking at #9153
@vyasr I took a look at that PR but mostly relied on the Pandas implementation for reference. I have made some changes as suggested by @shwina during our offline discussion. Please review if any more changes are required.
Great work so far! Could you also please add a benchmark for each of the functions being introduced here? The guidelines for writing benchmarks are still being written up here.
In the meantime, I would be very curious even if you posted informal benchmarks, using something like %timeit
— how much faster are we compared to Pandas (and how does that depend on the size of the data)? 😄
Great work so far! Could you also please add a benchmark for each of the functions being introduced here? The guidelines for writing benchmarks are still being written up here.
Added benchmarks for the two functions based on the guidelines in the PR above. Based on the mean time taken for 1000000 rows, pivot_table
performed ~ 5x faster and crosstab
performed ~ 2x faster than their respective Pandas counterparts.
@shaswat-indian Thanks for working on this! We're entering code freeze for the 22.08 release tomorrow, so I bumped this PR to target branch-22.10
. 👍
rerun tests
rerun tests
Codecov Report
:exclamation: No coverage uploaded for pull request base (
branch-22.10@039622f
). Click here to learn what that means. The diff coverage isn/a
.
@@ Coverage Diff @@
## branch-22.10 #11314 +/- ##
===============================================
Coverage ? 86.36%
===============================================
Files ? 145
Lines ? 22945
Branches ? 0
===============================================
Hits ? 19816
Misses ? 3129
Partials ? 0
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
I think this is ready to merge! Thanks @shaswat-indian!
@gpucibot merge