cudf icon indicating copy to clipboard operation
cudf copied to clipboard

Added 'crosstab' and 'pivot_table' features

Open shaswat-indian opened this issue 2 years ago • 10 comments

Resolves https://github.com/rapidsai/cudf/issues/5944, resolves https://github.com/rapidsai/cudf/issues/1214

This PR adds support for basic use cases of crosstab and pivot_table functions.

shaswat-indian avatar Jul 20 '22 22:07 shaswat-indian

Hi @shaswat-indian - thanks for taking this on! The approach taken here could be improved, particularly by relying less on Pandas internals and relying instead on cuDF's internals more. I'd be happy to go over the code with you offline before coming back here and doing another review!

shwina avatar Jul 21 '22 13:07 shwina

@shaswat-indian @shwina I haven't looked at this code at all, but just FYI you may also benefit from looking at #9153

vyasr avatar Jul 21 '22 20:07 vyasr

@vyasr I took a look at that PR but mostly relied on the Pandas implementation for reference. I have made some changes as suggested by @shwina during our offline discussion. Please review if any more changes are required.

shaswat-indian avatar Jul 21 '22 22:07 shaswat-indian

Great work so far! Could you also please add a benchmark for each of the functions being introduced here? The guidelines for writing benchmarks are still being written up here.

shwina avatar Jul 25 '22 19:07 shwina

In the meantime, I would be very curious even if you posted informal benchmarks, using something like %timeit — how much faster are we compared to Pandas (and how does that depend on the size of the data)? 😄

shwina avatar Jul 25 '22 19:07 shwina

Great work so far! Could you also please add a benchmark for each of the functions being introduced here? The guidelines for writing benchmarks are still being written up here.

Added benchmarks for the two functions based on the guidelines in the PR above. Based on the mean time taken for 1000000 rows, pivot_table performed ~ 5x faster and crosstab performed ~ 2x faster than their respective Pandas counterparts.

shaswat-indian avatar Jul 27 '22 02:07 shaswat-indian

@shaswat-indian Thanks for working on this! We're entering code freeze for the 22.08 release tomorrow, so I bumped this PR to target branch-22.10. 👍

bdice avatar Jul 28 '22 03:07 bdice

rerun tests

shaswat-indian avatar Aug 02 '22 19:08 shaswat-indian

rerun tests

shaswat-indian avatar Aug 08 '22 19:08 shaswat-indian

Codecov Report

:exclamation: No coverage uploaded for pull request base (branch-22.10@039622f). Click here to learn what that means. The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.10   #11314   +/-   ##
===============================================
  Coverage                ?   86.36%           
===============================================
  Files                   ?      145           
  Lines                   ?    22945           
  Branches                ?        0           
===============================================
  Hits                    ?    19816           
  Misses                  ?     3129           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

codecov[bot] avatar Aug 08 '22 21:08 codecov[bot]

I think this is ready to merge! Thanks @shaswat-indian!

bdice avatar Aug 11 '22 23:08 bdice

@gpucibot merge

bdice avatar Aug 11 '22 23:08 bdice