cudf icon indicating copy to clipboard operation
cudf copied to clipboard

Add distinct key inner join

Open PointKernel opened this issue 1 year ago • 6 comments

Description

Contributes to #14948

This PR adds a public cudf::distinct_hash_join class that provides a fast code path for joins with distinct keys.

Only distinct inner join is tackled in the current PR.

Checklist

  • [x] I am familiar with the Contributing Guidelines.
  • [x] New or existing tests cover these changes.
  • [x] The documentation is up to date with these changes.

PointKernel avatar Feb 07 '24 03:02 PointKernel

🔥🔥🔥

GregoryKimball avatar Feb 07 '24 04:02 GregoryKimball

@jlowe The prototype should be ready for testing. Please let me know if you find any bugs/issues.

I'm testing/tuning several alternative algorithms locally for the probe kernel so no need to worry about the performance for now.

TODO:

  • [x] Add benchmarks
  • [x] Add more unit tests
  • [x] Add scalar-based probe kernel with scan
  • [x] Evaluate scalar-based kernel and CG-based kernel
  • [x] Per-block flush v.s. per-CG flush
  • [x] Tune hash table probing scheme, cg_size, window_size, hasher, buffer size and block size

PointKernel avatar Feb 08 '24 02:02 PointKernel

@gaohao95 @bdice Requested your reviews since the prototype is up for testing.

PointKernel avatar Feb 08 '24 02:02 PointKernel

On the performance side, I tested 67c683b18a46686867f3f0b5e65bcff79e4d6f74 on an A100 with 1 billion probe table rows with various build table rows and 0.01 selectivity. The speedup is good, a solid 2x against cudf::hash_join as a baseline, but perhaps the probe performance can be even better if we improve occupancy (currently at 44%).

Build row Build time baseline Probe time baseline Build time Probe time Probe speedup Overall speedup
16 0 153 0 64 2.39 2.39
64 0 203 0 71 2.86 2.86
256 0 199 0 81 2.46 2.46
1000 0 210 0 81 2.59 2.59
4000 0 217 0 83 2.61 2.61
16000 0 236 0 95 2.48 2.48
64000 0 236 0 97 2.43 2.43
256000 0 240 0 98 2.45 2.45
1000000 0 249 0 103 2.42 2.42
4000000 0 288 0 128 2.25 2.25
16000000 3 302 3 135 2.24 2.21
64000000 12 305 13 137 2.23 2.11
256000000 52 302 52 135 2.24 1.89
1000000000 203 282 206 128 2.20 1.45

gaohao95 avatar Feb 09 '24 00:02 gaohao95

Are you referring to GPU occupancy or hash map occupancy (or load factor)?

The occupancy of the kernel unique_join_probe_kernel.

gaohao95 avatar Feb 09 '24 01:02 gaohao95

Benchmark results between hash_join::inner_join and distinct_hash_join::inner_join on my local RTX8000:

inner_join_32bit

Key Type Payload Type Nullable Build Table Size Probe Table Size Samples CPU Time Noise GPU Time Noise
I32 I32 0 100000 100000 4432x 116.702 us 14.87% 112.959 us 14.27%
I32 I32 0 100000 400000 1936x 261.727 us 1.84% 258.278 us 1.24%
I32 I32 0 10000000 10000000 31x 16.593 ms 0.08% 16.589 ms 0.07%
I32 I32 0 10000000 40000000 11x 47.024 ms 0.03% 47.022 ms 0.03%
I32 I32 0 10000000 100000000 11x 108.720 ms 0.02% 108.718 ms 0.02%
I32 I32 0 80000000 100000000 11x 157.173 ms 0.02% 157.171 ms 0.02%
I32 I32 0 100000000 100000000 11x 171.494 ms 0.06% 171.493 ms 0.06%
I32 I32 0 10000000 240000000 11x 259.605 ms 0.01% 259.605 ms 0.01%
I32 I32 0 80000000 240000000 11x 297.980 ms 0.03% 297.980 ms 0.03%
I32 I32 0 100000000 240000000 11x 312.012 ms 0.02% 312.011 ms 0.02%

distinct_inner_join_32bit

Key Type Payload Type Nullable Build Table Size Probe Table Size Samples CPU Time Noise GPU Time Noise
I32 I32 0 100000 100000 7920x 66.934 us 11.01% 63.133 us 8.77%
I32 I32 0 100000 400000 2944x 173.634 us 2.93% 170.163 us 2.07%
I32 I32 0 10000000 10000000 41x 12.212 ms 0.12% 12.208 ms 0.11%
I32 I32 0 10000000 40000000 13x 38.596 ms 0.03% 38.593 ms 0.03%
I32 I32 0 10000000 100000000 11x 88.920 ms 0.14% 88.918 ms 0.14%
I32 I32 0 80000000 100000000 11x 118.820 ms 0.03% 118.818 ms 0.03%
I32 I32 0 100000000 100000000 11x 126.045 ms 0.02% 126.043 ms 0.02%
I32 I32 0 10000000 240000000 11x 211.569 ms 0.01% 211.567 ms 0.01%
I32 I32 0 80000000 240000000 11x 242.496 ms 0.05% 242.495 ms 0.05%
I32 I32 0 100000000 240000000 11x 250.907 ms 0.02% 250.906 ms 0.02%

PointKernel avatar Feb 15 '24 22:02 PointKernel

/merge

PointKernel avatar Feb 23 '24 19:02 PointKernel