cudf
cudf copied to clipboard
Add distinct key inner join
Description
Contributes to #14948
This PR adds a public cudf::distinct_hash_join
class that provides a fast code path for joins with distinct keys.
Only distinct inner join is tackled in the current PR.
Checklist
- [x] I am familiar with the Contributing Guidelines.
- [x] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.
🔥🔥🔥
@jlowe The prototype should be ready for testing. Please let me know if you find any bugs/issues.
I'm testing/tuning several alternative algorithms locally for the probe kernel so no need to worry about the performance for now.
TODO
:
- [x] Add benchmarks
- [x] Add more unit tests
- [x] Add scalar-based probe kernel with scan
- [x] Evaluate scalar-based kernel and CG-based kernel
- [x] Per-block flush v.s. per-CG flush
- [x] Tune hash table probing scheme, cg_size, window_size, hasher, buffer size and block size
@gaohao95 @bdice Requested your reviews since the prototype is up for testing.
On the performance side, I tested 67c683b18a46686867f3f0b5e65bcff79e4d6f74 on an A100 with 1 billion probe table rows with various build table rows and 0.01 selectivity. The speedup is good, a solid 2x against cudf::hash_join
as a baseline, but perhaps the probe performance can be even better if we improve occupancy (currently at 44%).
Build row | Build time baseline | Probe time baseline | Build time | Probe time | Probe speedup | Overall speedup |
---|---|---|---|---|---|---|
16 | 0 | 153 | 0 | 64 | 2.39 | 2.39 |
64 | 0 | 203 | 0 | 71 | 2.86 | 2.86 |
256 | 0 | 199 | 0 | 81 | 2.46 | 2.46 |
1000 | 0 | 210 | 0 | 81 | 2.59 | 2.59 |
4000 | 0 | 217 | 0 | 83 | 2.61 | 2.61 |
16000 | 0 | 236 | 0 | 95 | 2.48 | 2.48 |
64000 | 0 | 236 | 0 | 97 | 2.43 | 2.43 |
256000 | 0 | 240 | 0 | 98 | 2.45 | 2.45 |
1000000 | 0 | 249 | 0 | 103 | 2.42 | 2.42 |
4000000 | 0 | 288 | 0 | 128 | 2.25 | 2.25 |
16000000 | 3 | 302 | 3 | 135 | 2.24 | 2.21 |
64000000 | 12 | 305 | 13 | 137 | 2.23 | 2.11 |
256000000 | 52 | 302 | 52 | 135 | 2.24 | 1.89 |
1000000000 | 203 | 282 | 206 | 128 | 2.20 | 1.45 |
Are you referring to GPU occupancy or hash map occupancy (or load factor)?
The occupancy of the kernel unique_join_probe_kernel
.
Benchmark results between hash_join::inner_join
and distinct_hash_join::inner_join
on my local RTX8000:
inner_join_32bit
Key Type | Payload Type | Nullable | Build Table Size | Probe Table Size | Samples | CPU Time | Noise | GPU Time | Noise |
---|---|---|---|---|---|---|---|---|---|
I32 | I32 | 0 | 100000 | 100000 | 4432x | 116.702 us | 14.87% | 112.959 us | 14.27% |
I32 | I32 | 0 | 100000 | 400000 | 1936x | 261.727 us | 1.84% | 258.278 us | 1.24% |
I32 | I32 | 0 | 10000000 | 10000000 | 31x | 16.593 ms | 0.08% | 16.589 ms | 0.07% |
I32 | I32 | 0 | 10000000 | 40000000 | 11x | 47.024 ms | 0.03% | 47.022 ms | 0.03% |
I32 | I32 | 0 | 10000000 | 100000000 | 11x | 108.720 ms | 0.02% | 108.718 ms | 0.02% |
I32 | I32 | 0 | 80000000 | 100000000 | 11x | 157.173 ms | 0.02% | 157.171 ms | 0.02% |
I32 | I32 | 0 | 100000000 | 100000000 | 11x | 171.494 ms | 0.06% | 171.493 ms | 0.06% |
I32 | I32 | 0 | 10000000 | 240000000 | 11x | 259.605 ms | 0.01% | 259.605 ms | 0.01% |
I32 | I32 | 0 | 80000000 | 240000000 | 11x | 297.980 ms | 0.03% | 297.980 ms | 0.03% |
I32 | I32 | 0 | 100000000 | 240000000 | 11x | 312.012 ms | 0.02% | 312.011 ms | 0.02% |
distinct_inner_join_32bit
Key Type | Payload Type | Nullable | Build Table Size | Probe Table Size | Samples | CPU Time | Noise | GPU Time | Noise |
---|---|---|---|---|---|---|---|---|---|
I32 | I32 | 0 | 100000 | 100000 | 7920x | 66.934 us | 11.01% | 63.133 us | 8.77% |
I32 | I32 | 0 | 100000 | 400000 | 2944x | 173.634 us | 2.93% | 170.163 us | 2.07% |
I32 | I32 | 0 | 10000000 | 10000000 | 41x | 12.212 ms | 0.12% | 12.208 ms | 0.11% |
I32 | I32 | 0 | 10000000 | 40000000 | 13x | 38.596 ms | 0.03% | 38.593 ms | 0.03% |
I32 | I32 | 0 | 10000000 | 100000000 | 11x | 88.920 ms | 0.14% | 88.918 ms | 0.14% |
I32 | I32 | 0 | 80000000 | 100000000 | 11x | 118.820 ms | 0.03% | 118.818 ms | 0.03% |
I32 | I32 | 0 | 100000000 | 100000000 | 11x | 126.045 ms | 0.02% | 126.043 ms | 0.02% |
I32 | I32 | 0 | 10000000 | 240000000 | 11x | 211.569 ms | 0.01% | 211.567 ms | 0.01% |
I32 | I32 | 0 | 80000000 | 240000000 | 11x | 242.496 ms | 0.05% | 242.495 ms | 0.05% |
I32 | I32 | 0 | 100000000 | 240000000 | 11x | 250.907 ms | 0.02% | 250.906 ms | 0.02% |
/merge