cudf icon indicating copy to clipboard operation
cudf copied to clipboard

Use cuco::static_set in the hash-based groupby

Open PointKernel opened this issue 1 year ago • 1 comments

Description

Depends on https://github.com/rapidsai/cudf/pull/14849

Contributes to #12261

This PR migrates hash groupby to use the new cuco::static_set data structure.

Checklist

  • [x] I am familiar with the Contributing Guidelines.
  • [x] New or existing tests cover these changes.
  • [x] The documentation is up to date with these changes.

PointKernel avatar Jan 19 '24 23:01 PointKernel

Question to reviewers: is it worth doing runtime dispatching based on whether nested types are involved or not?

Benchmark results are shown below, TLDR:

  • The hash table size is about half as before
  • The new set groupby brings from 0 to 30% speedups for flat types
  • The improvement for nested type is noticeable but not as good as flat types

Based on our past tuning experience, e.g.: https://github.com/rapidsai/cudf/blob/3ba63c3c3cb72950adc4c9699fcfa1a72796a041/cpp/src/search/contains_table.cu#L158-L165 using a larger CG size (like 2 or 4) for nested types can bring significant speedups comapred to the current CGSize == 1 for all types.

groupby_max

[0] Quadro RTX 8000

T num_rows null_probability Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I32 2^12 0 83.292 us 7.14% 83.962 us 7.56% 0.670 us 0.80% PASS
I32 2^18 0 123.161 us 21.91% 103.673 us 3.47% -19.489 us -15.82% FAIL
I32 2^24 0 2.702 ms 5.78% 1.886 ms 4.61% -815.711 us -30.19% FAIL
I32 2^12 0.1 106.195 us 4.47% 105.438 us 5.46% -0.757 us -0.71% PASS
I32 2^18 0.1 144.568 us 4.03% 130.310 us 4.62% -14.258 us -9.86% FAIL
I32 2^24 0.1 2.713 ms 2.97% 1.956 ms 3.43% -756.833 us -27.90% FAIL
I32 2^12 0.9 106.164 us 5.38% 105.230 us 5.81% -0.933 us -0.88% PASS
I32 2^18 0.9 134.932 us 3.71% 126.974 us 4.86% -7.958 us -5.90% FAIL
I32 2^24 0.9 2.396 ms 0.49% 1.700 ms 0.94% -695.351 us -29.03% FAIL
I64 2^12 0 79.217 us 4.74% 81.464 us 5.33% 2.247 us 2.84% PASS
I64 2^18 0 124.065 us 3.35% 110.290 us 3.71% -13.776 us -11.10% FAIL
I64 2^24 0 2.921 ms 5.13% 2.080 ms 3.67% -841.509 us -28.81% FAIL
I64 2^12 0.1 107.544 us 5.30% 106.271 us 5.33% -1.273 us -1.18% PASS
I64 2^18 0.1 158.161 us 12.67% 138.162 us 3.74% -19.999 us -12.64% FAIL
I64 2^24 0.1 2.935 ms 3.03% 2.149 ms 2.33% -786.224 us -26.79% FAIL
I64 2^12 0.9 106.698 us 5.45% 105.586 us 5.56% -1.112 us -1.04% PASS
I64 2^18 0.9 141.798 us 3.56% 131.287 us 5.12% -10.512 us -7.41% FAIL
I64 2^24 0.9 2.528 ms 0.41% 1.843 ms 0.67% -684.869 us -27.09% FAIL
F32 2^12 0 87.432 us 4.34% 87.644 us 5.17% 0.212 us 0.24% PASS
F32 2^18 0 139.932 us 5.17% 120.252 us 3.90% -19.680 us -14.06% FAIL
F32 2^24 0 3.034 ms 8.05% 2.159 ms 8.12% -875.115 us -28.85% FAIL
F32 2^12 0.1 118.647 us 4.18% 117.394 us 4.97% -1.254 us -1.06% PASS
F32 2^18 0.1 167.872 us 5.11% 145.176 us 3.74% -22.696 us -13.52% FAIL
F32 2^24 0.1 2.997 ms 5.81% 2.172 ms 6.50% -825.041 us -27.53% FAIL
F32 2^12 0.9 121.641 us 6.73% 111.989 us 5.16% -9.652 us -7.93% FAIL
F32 2^18 0.9 143.159 us 4.11% 131.017 us 3.83% -12.142 us -8.48% FAIL
F32 2^24 0.9 2.489 ms 0.43% 1.787 ms 0.84% -702.675 us -28.23% FAIL
F64 2^12 0 87.555 us 4.14% 88.347 us 4.57% 0.792 us 0.90% PASS
F64 2^18 0 156.124 us 17.01% 127.645 us 4.27% -28.480 us -18.24% FAIL
F64 2^24 0 3.300 ms 8.27% 2.337 ms 8.59% -963.100 us -29.19% FAIL
F64 2^12 0.1 119.458 us 4.38% 117.437 us 5.24% -2.021 us -1.69% PASS
F64 2^18 0.1 173.348 us 4.67% 152.983 us 4.03% -20.365 us -11.75% FAIL
F64 2^24 0.1 3.249 ms 6.93% 2.349 ms 5.91% -900.036 us -27.70% FAIL
F64 2^12 0.9 112.170 us 4.52% 113.566 us 5.73% 1.396 us 1.24% PASS
F64 2^18 0.9 148.631 us 3.42% 135.485 us 3.78% -13.146 us -8.84% FAIL
F64 2^24 0.9 2.612 ms 0.49% 1.906 ms 0.64% -705.332 us -27.01% FAIL

groupby_struct_keys

[0] Quadro RTX 8000

NumRows Depth Nulls Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
2^10 0 0 71.830 us 4.79% 68.804 us 6.57% -3.026 us -4.21% PASS
2^16 0 0 78.070 us 5.15% 81.142 us 5.82% 3.072 us 3.94% PASS
2^20 0 0 267.730 us 8.76% 228.207 us 12.14% -39.523 us -14.76% FAIL
2^10 1 0 192.670 us 4.20% 195.295 us 4.21% 2.625 us 1.36% PASS
2^16 1 0 199.809 us 4.59% 199.813 us 4.16% 0.004 us 0.00% PASS
2^20 1 0 403.536 us 4.08% 400.537 us 7.91% -2.999 us -0.74% PASS
2^10 8 0 631.825 us 2.70% 628.752 us 3.03% -3.073 us -0.49% PASS
2^16 8 0 651.027 us 2.83% 645.891 us 2.99% -5.136 us -0.79% PASS
2^20 8 0 1.049 ms 2.41% 990.536 us 2.78% -58.560 us -5.58% FAIL
2^10 0 1 127.173 us 5.66% 127.897 us 3.85% 0.724 us 0.57% PASS
2^16 0 1 128.940 us 4.33% 132.152 us 4.29% 3.212 us 2.49% PASS
2^20 0 1 304.425 us 5.95% 268.729 us 7.05% -35.696 us -11.73% FAIL
2^10 1 1 193.054 us 4.19% 195.841 us 4.19% 2.787 us 1.44% PASS
2^16 1 1 199.877 us 4.14% 201.166 us 4.07% 1.289 us 0.64% PASS
2^20 1 1 429.069 us 5.37% 431.467 us 7.65% 2.398 us 0.56% PASS
2^10 8 1 627.959 us 2.97% 625.391 us 2.84% -2.568 us -0.41% PASS
2^16 8 1 654.510 us 3.21% 648.749 us 2.94% -5.761 us -0.88% PASS
2^20 8 1 1.055 ms 2.13% 1.006 ms 2.87% -49.153 us -4.66% FAIL

PointKernel avatar Feb 16 '24 22:02 PointKernel

@vyasr Thanks for the review, have you reviewed both cpp and python or do I need another cpp/python approval to merge the PR?

PointKernel avatar Feb 27 '24 21:02 PointKernel

I reviewed both. I think the main question on the Python side would be whether we need to update the docs in https://github.com/rapidsai/cudf/blob/branch-24.04/docs/cudf/source/user_guide/pandas-comparison.md#result-ordering to also mention value_counts. @shwina WDYT? It's not immediately obvious that this would be implemented as a groupby under the hood.

vyasr avatar Feb 27 '24 22:02 vyasr

I think we should do that, yes (this PR is then a breaking change)

shwina avatar Feb 28 '24 18:02 shwina

Also CC @mroeschke @galipremsagar we may have to introduce an extra sorting in pandas compatibility mode due to the ordering change in this PR for cudf.pandas to behave as expected. No need to make any changes yet, just something to keep on your radars.

vyasr avatar Feb 28 '24 20:02 vyasr

@vyasr @shwina I've updated the doc as suggested and marked this work as a breaking change.

Thanks for your comments.

PointKernel avatar Feb 29 '24 01:02 PointKernel

/merge

PointKernel avatar Feb 29 '24 21:02 PointKernel