cudf
cudf copied to clipboard
Refactor joins for conditional semis and antis
Contributes to #10039
Currently conditional_joins
for both semi and anti joins rely on an implementation that was designed for taking in results from both tables involved in the join. This leads to wasteful allocation that can be optimized for these two cases.
Description
Add a new kernel to be used for both semi and anti joins. Add some new device functions for adding only one array of shared_memory for caching.
Tests pass on my 3080.
Checklist
- [x] I am familiar with the Contributing Guidelines.
- [x] New or existing tests cover these changes.
- [ ] The documentation is up to date with these changes.
This pull request requires additional validation before any workflows can run on NVIDIA's runners.
Pull request vetters can view their responsibilities here.
Contributors can view more details about this message here.
CC @bdice @vyasr please let me know if changes needed to be made or if i misunderstood anything. I imagine in desire of keeping PRs smaller that this shouldn't touch the size APIs
Did some benchmarking with branch-24.02
and this branch, performance gains were negligible/statistically insignificant(1-3% gains). However, I made some changes by removing the compute_size
kernels, and used a pessimistic assumption that the size would always be the left table size N
(compromise memory for runtime speed up), and gains were significant
My specs are as follows
CPU: 12th Gen Intel(R) Core(TM) i9-12900K, 3200 Mhz, 16 Core(s), 24 Logical Processor(s) GPU: RTX 3080. RAM: 64gb ddr5 OS: WSL2 Win 11 host os
/ok to test
@vyasr would you please take a look when you get back?
Please note that this PR addresses part of https://github.com/rapidsai/cudf/issues/10039
/ok to test
@DanialJavady96 Making this ready for review to draw proper attention from reviewers
On this branch:
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit/100000/100000/manual_time 314 ms 314 ms 2
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit/100000/400000/manual_time 1138 ms 1138 ms 1
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit/100000/1000000/manual_time 2771 ms 2771 ms 1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit/100000/100000/manual_time 322 ms 322 ms 2
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit/100000/400000/manual_time 1161 ms 1162 ms 1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit/100000/1000000/manual_time 2836 ms 2836 ms 1
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit_nulls/100000/100000/manual_time 540 ms 540 ms 1
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit_nulls/100000/400000/manual_time 1935 ms 1935 ms 1
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit_nulls/100000/1000000/manual_time 4747 ms 4747 ms 1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit_nulls/100000/100000/manual_time 548 ms 548 ms 1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit_nulls/100000/400000/manual_time 2001 ms 2001 ms 1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit_nulls/100000/1000000/manual_time 4881 ms 4881 ms 1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit/100000/100000/manual_time 323 ms 323 ms 2
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit/100000/400000/manual_time 1155 ms 1155 ms 1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit/100000/1000000/manual_time 2784 ms 2784 ms 1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit/100000/100000/manual_time 327 ms 327 ms 2
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit/100000/400000/manual_time 1163 ms 1163 ms 1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit/100000/1000000/manual_time 2906 ms 2906 ms 1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit_nulls/100000/100000/manual_time 544 ms 544 ms 1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit_nulls/100000/400000/manual_time 1986 ms 1986 ms 1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit_nulls/100000/1000000/manual_time 4774 ms 4774 ms 1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit_nulls/100000/100000/manual_time 559 ms 559 ms 1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit_nulls/100000/400000/manual_time 2045 ms 2045 ms 1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit_nulls/100000/1000000/manual_time 4925 ms 4925 ms 1
On branch-24.02:
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit/100000/100000/manual_time 317 ms 317 ms 2
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit/100000/400000/manual_time 1138 ms 1137 ms 1
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit/100000/1000000/manual_time 2788 ms 2788 ms 1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit/100000/100000/manual_time 323 ms 323 ms 2
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit/100000/400000/manual_time 1167 ms 1167 ms 1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit/100000/1000000/manual_time 2861 ms 2861 ms 1
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit_nulls/100000/100000/manual_time 543 ms 543 ms 1
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit_nulls/100000/400000/manual_time 1952 ms 1952 ms 1
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit_nulls/100000/1000000/manual_time 4830 ms 4830 ms 1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit_nulls/100000/100000/manual_time 576 ms 576 ms 1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit_nulls/100000/400000/manual_time 2018 ms 2018 ms 1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit_nulls/100000/1000000/manual_time 4931 ms 4931 ms 1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit/100000/100000/manual_time 323 ms 323 ms 2
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit/100000/400000/manual_time 1151 ms 1151 ms 1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit/100000/1000000/manual_time 2841 ms 2841 ms 1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit/100000/100000/manual_time 330 ms 330 ms 2
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit/100000/400000/manual_time 1180 ms 1180 ms 1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit/100000/1000000/manual_time 2961 ms 2961 ms 1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit_nulls/100000/100000/manual_time 540 ms 540 ms 1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit_nulls/100000/400000/manual_time 1962 ms 1962 ms 1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit_nulls/100000/1000000/manual_time 4813 ms 4813 ms 1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit_nulls/100000/100000/manual_time 566 ms 566 ms 1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit_nulls/100000/400000/manual_time 2085 ms 2085 ms 1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit_nulls/100000/1000000/manual_time 5063 ms 5063 ms 1
Unfortunately not getting significant speed ups. Would it make sense to include the removal of the join size kernels? @PointKernel
/ok to test
/ok to test
Do we need any expanded tests? I'll try to look into that.
Responding to myself -- I think our testing looks okay for now. I don't know of anything that would need to be changed. https://github.com/rapidsai/cudf/blob/branch-24.06/cpp/tests/join/conditional_join_tests.cu
/ok to test
@bdice
Benchmark Time CPU Iterations
----------------------------------------------------------------------------------------------------------------------------------------------
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit/100000/100000/manual_time 311 ms 312 ms 2
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit/100000/400000/manual_time 1126 ms 1126 ms 1
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit/100000/1000000/manual_time 2748 ms 2748 ms 1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit/100000/100000/manual_time 318 ms 318 ms 2
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit/100000/400000/manual_time 1147 ms 1147 ms 1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit/100000/1000000/manual_time 2796 ms 2796 ms 1
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit_nulls/100000/100000/manual_time 415 ms 415 ms 2
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit_nulls/100000/400000/manual_time 1485 ms 1485 ms 1
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit_nulls/100000/1000000/manual_time 3605 ms 3605 ms 1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit_nulls/100000/100000/manual_time 417 ms 417 ms 2
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit_nulls/100000/400000/manual_time 1497 ms 1497 ms 1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit_nulls/100000/1000000/manual_time 3651 ms 3651 ms 1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit/100000/100000/manual_time 310 ms 310 ms 2
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit/100000/400000/manual_time 1117 ms 1117 ms 1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit/100000/1000000/manual_time 2725 ms 2725 ms 1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit/100000/100000/manual_time 316 ms 316 ms 2
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit/100000/400000/manual_time 1142 ms 1142 ms 1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit/100000/1000000/manual_time 2782 ms 2782 ms 1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit_nulls/100000/100000/manual_time 412 ms 412 ms 2
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit_nulls/100000/400000/manual_time 1482 ms 1482 ms 1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit_nulls/100000/1000000/manual_time 3615 ms 3615 ms 1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit_nulls/100000/100000/manual_time 418 ms 418 ms 2
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit_nulls/100000/400000/manual_time 1501 ms 1501 ms 1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit_nulls/100000/1000000/manual_time 3658 ms 3658 ms 1
(pyt_dev) ksm@Kashimo:~/cudf/cpp/build/benchmarks$
Compared to the benchmarks here,
https://github.com/rapidsai/cudf/pull/14646#issuecomment-1877775716
Looks pretty good! Some of the gains are quite significant.
/ok to test
/merge