cudf
cudf copied to clipboard
Use cuco::static_set in the hash-based groupby
Description
Depends on https://github.com/rapidsai/cudf/pull/14849
Contributes to #12261
This PR migrates hash groupby to use the new cuco::static_set
data structure.
Checklist
- [x] I am familiar with the Contributing Guidelines.
- [x] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.
Question to reviewers: is it worth doing runtime dispatching based on whether nested types are involved or not?
Benchmark results are shown below, TLDR:
- The hash table size is about half as before
- The new set groupby brings from 0 to 30% speedups for flat types
- The improvement for nested type is noticeable but not as good as flat types
Based on our past tuning experience, e.g.: https://github.com/rapidsai/cudf/blob/3ba63c3c3cb72950adc4c9699fcfa1a72796a041/cpp/src/search/contains_table.cu#L158-L165 using a larger CG size (like 2 or 4) for nested types can bring significant speedups comapred to the current CGSize == 1
for all types.
groupby_max
[0] Quadro RTX 8000
T | num_rows | null_probability | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
---|---|---|---|---|---|---|---|---|---|
I32 | 2^12 | 0 | 83.292 us | 7.14% | 83.962 us | 7.56% | 0.670 us | 0.80% | PASS |
I32 | 2^18 | 0 | 123.161 us | 21.91% | 103.673 us | 3.47% | -19.489 us | -15.82% | FAIL |
I32 | 2^24 | 0 | 2.702 ms | 5.78% | 1.886 ms | 4.61% | -815.711 us | -30.19% | FAIL |
I32 | 2^12 | 0.1 | 106.195 us | 4.47% | 105.438 us | 5.46% | -0.757 us | -0.71% | PASS |
I32 | 2^18 | 0.1 | 144.568 us | 4.03% | 130.310 us | 4.62% | -14.258 us | -9.86% | FAIL |
I32 | 2^24 | 0.1 | 2.713 ms | 2.97% | 1.956 ms | 3.43% | -756.833 us | -27.90% | FAIL |
I32 | 2^12 | 0.9 | 106.164 us | 5.38% | 105.230 us | 5.81% | -0.933 us | -0.88% | PASS |
I32 | 2^18 | 0.9 | 134.932 us | 3.71% | 126.974 us | 4.86% | -7.958 us | -5.90% | FAIL |
I32 | 2^24 | 0.9 | 2.396 ms | 0.49% | 1.700 ms | 0.94% | -695.351 us | -29.03% | FAIL |
I64 | 2^12 | 0 | 79.217 us | 4.74% | 81.464 us | 5.33% | 2.247 us | 2.84% | PASS |
I64 | 2^18 | 0 | 124.065 us | 3.35% | 110.290 us | 3.71% | -13.776 us | -11.10% | FAIL |
I64 | 2^24 | 0 | 2.921 ms | 5.13% | 2.080 ms | 3.67% | -841.509 us | -28.81% | FAIL |
I64 | 2^12 | 0.1 | 107.544 us | 5.30% | 106.271 us | 5.33% | -1.273 us | -1.18% | PASS |
I64 | 2^18 | 0.1 | 158.161 us | 12.67% | 138.162 us | 3.74% | -19.999 us | -12.64% | FAIL |
I64 | 2^24 | 0.1 | 2.935 ms | 3.03% | 2.149 ms | 2.33% | -786.224 us | -26.79% | FAIL |
I64 | 2^12 | 0.9 | 106.698 us | 5.45% | 105.586 us | 5.56% | -1.112 us | -1.04% | PASS |
I64 | 2^18 | 0.9 | 141.798 us | 3.56% | 131.287 us | 5.12% | -10.512 us | -7.41% | FAIL |
I64 | 2^24 | 0.9 | 2.528 ms | 0.41% | 1.843 ms | 0.67% | -684.869 us | -27.09% | FAIL |
F32 | 2^12 | 0 | 87.432 us | 4.34% | 87.644 us | 5.17% | 0.212 us | 0.24% | PASS |
F32 | 2^18 | 0 | 139.932 us | 5.17% | 120.252 us | 3.90% | -19.680 us | -14.06% | FAIL |
F32 | 2^24 | 0 | 3.034 ms | 8.05% | 2.159 ms | 8.12% | -875.115 us | -28.85% | FAIL |
F32 | 2^12 | 0.1 | 118.647 us | 4.18% | 117.394 us | 4.97% | -1.254 us | -1.06% | PASS |
F32 | 2^18 | 0.1 | 167.872 us | 5.11% | 145.176 us | 3.74% | -22.696 us | -13.52% | FAIL |
F32 | 2^24 | 0.1 | 2.997 ms | 5.81% | 2.172 ms | 6.50% | -825.041 us | -27.53% | FAIL |
F32 | 2^12 | 0.9 | 121.641 us | 6.73% | 111.989 us | 5.16% | -9.652 us | -7.93% | FAIL |
F32 | 2^18 | 0.9 | 143.159 us | 4.11% | 131.017 us | 3.83% | -12.142 us | -8.48% | FAIL |
F32 | 2^24 | 0.9 | 2.489 ms | 0.43% | 1.787 ms | 0.84% | -702.675 us | -28.23% | FAIL |
F64 | 2^12 | 0 | 87.555 us | 4.14% | 88.347 us | 4.57% | 0.792 us | 0.90% | PASS |
F64 | 2^18 | 0 | 156.124 us | 17.01% | 127.645 us | 4.27% | -28.480 us | -18.24% | FAIL |
F64 | 2^24 | 0 | 3.300 ms | 8.27% | 2.337 ms | 8.59% | -963.100 us | -29.19% | FAIL |
F64 | 2^12 | 0.1 | 119.458 us | 4.38% | 117.437 us | 5.24% | -2.021 us | -1.69% | PASS |
F64 | 2^18 | 0.1 | 173.348 us | 4.67% | 152.983 us | 4.03% | -20.365 us | -11.75% | FAIL |
F64 | 2^24 | 0.1 | 3.249 ms | 6.93% | 2.349 ms | 5.91% | -900.036 us | -27.70% | FAIL |
F64 | 2^12 | 0.9 | 112.170 us | 4.52% | 113.566 us | 5.73% | 1.396 us | 1.24% | PASS |
F64 | 2^18 | 0.9 | 148.631 us | 3.42% | 135.485 us | 3.78% | -13.146 us | -8.84% | FAIL |
F64 | 2^24 | 0.9 | 2.612 ms | 0.49% | 1.906 ms | 0.64% | -705.332 us | -27.01% | FAIL |
groupby_struct_keys
[0] Quadro RTX 8000
NumRows | Depth | Nulls | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
---|---|---|---|---|---|---|---|---|---|
2^10 | 0 | 0 | 71.830 us | 4.79% | 68.804 us | 6.57% | -3.026 us | -4.21% | PASS |
2^16 | 0 | 0 | 78.070 us | 5.15% | 81.142 us | 5.82% | 3.072 us | 3.94% | PASS |
2^20 | 0 | 0 | 267.730 us | 8.76% | 228.207 us | 12.14% | -39.523 us | -14.76% | FAIL |
2^10 | 1 | 0 | 192.670 us | 4.20% | 195.295 us | 4.21% | 2.625 us | 1.36% | PASS |
2^16 | 1 | 0 | 199.809 us | 4.59% | 199.813 us | 4.16% | 0.004 us | 0.00% | PASS |
2^20 | 1 | 0 | 403.536 us | 4.08% | 400.537 us | 7.91% | -2.999 us | -0.74% | PASS |
2^10 | 8 | 0 | 631.825 us | 2.70% | 628.752 us | 3.03% | -3.073 us | -0.49% | PASS |
2^16 | 8 | 0 | 651.027 us | 2.83% | 645.891 us | 2.99% | -5.136 us | -0.79% | PASS |
2^20 | 8 | 0 | 1.049 ms | 2.41% | 990.536 us | 2.78% | -58.560 us | -5.58% | FAIL |
2^10 | 0 | 1 | 127.173 us | 5.66% | 127.897 us | 3.85% | 0.724 us | 0.57% | PASS |
2^16 | 0 | 1 | 128.940 us | 4.33% | 132.152 us | 4.29% | 3.212 us | 2.49% | PASS |
2^20 | 0 | 1 | 304.425 us | 5.95% | 268.729 us | 7.05% | -35.696 us | -11.73% | FAIL |
2^10 | 1 | 1 | 193.054 us | 4.19% | 195.841 us | 4.19% | 2.787 us | 1.44% | PASS |
2^16 | 1 | 1 | 199.877 us | 4.14% | 201.166 us | 4.07% | 1.289 us | 0.64% | PASS |
2^20 | 1 | 1 | 429.069 us | 5.37% | 431.467 us | 7.65% | 2.398 us | 0.56% | PASS |
2^10 | 8 | 1 | 627.959 us | 2.97% | 625.391 us | 2.84% | -2.568 us | -0.41% | PASS |
2^16 | 8 | 1 | 654.510 us | 3.21% | 648.749 us | 2.94% | -5.761 us | -0.88% | PASS |
2^20 | 8 | 1 | 1.055 ms | 2.13% | 1.006 ms | 2.87% | -49.153 us | -4.66% | FAIL |
@vyasr Thanks for the review, have you reviewed both cpp and python or do I need another cpp/python approval to merge the PR?
I reviewed both. I think the main question on the Python side would be whether we need to update the docs in https://github.com/rapidsai/cudf/blob/branch-24.04/docs/cudf/source/user_guide/pandas-comparison.md#result-ordering to also mention value_counts
. @shwina WDYT? It's not immediately obvious that this would be implemented as a groupby under the hood.
I think we should do that, yes (this PR is then a breaking change)
Also CC @mroeschke @galipremsagar we may have to introduce an extra sorting in pandas compatibility mode due to the ordering change in this PR for cudf.pandas to behave as expected. No need to make any changes yet, just something to keep on your radars.
@vyasr @shwina I've updated the doc as suggested and marked this work as a breaking change.
Thanks for your comments.
/merge