cudf
cudf copied to clipboard
Occupancy improvement for Hash table build
Description
Implements specialized template dispatch for hash joins and mixed semi joins to fix issue describes in https://github.com/rapidsai/cudf/issues/15502.
At a high level, this PR typedef's some types to void depending on the column types in the row's to avoid high register usage for comparator and hasher operations associated with more involved types (lists, structs, string, ...). This is done by dynamic dispatch on CPU side using std::variant+std::visit and dispatching with a specialized template.
This pattern can later be extended to other joins and also to groupby operation. Any operator using row hasher and row comparator should be able to see and improvement in occupancy for hash table build/probe operation.
Checklist
- [x] I am familiar with the Contributing Guidelines.
- [x] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.
This pull request requires additional validation before any workflows can run on NVIDIA's runners.
Pull request vetters can view their responsibilities here.
Contributors can view more details about this message here.
I think the approach of specializing the type dispatcher is very cumbersome and will lead to a lot of code replication. Currently, I have the conditional dispatch working for device_row_hasher
but I am unsure if there is a better way to implement this. We could introduce a macro here to generate the code, what do you think?
/ok to test
/ok to test
@tgujar I've updated the docs to unblock CI. Have you noticed any performance regressions for other use cases? It seems that it improves the performance for mixed join but the performance drops significantly in other cases using row hasher.
/ok to test
@tgujar Could you take a look at the failing tests?
/ok to test
/ok to test
This PR needs to be rebased on branch-24.08.
Specializing both the comparator and the hasher drops the register usage to 54 instead of the expected 46 for the mixed semi join case. Investigating why the register pressure is different from commenting out the code paths. The current plan is to avoid using a macro(as mentioned here) and instead do dynamic dispatch on CPU side using std::variant and std::visit
I have a question here. Is it preferable that I make the changes to all the join operations in this PR or break them up into different ones?
I have a question here. Is it preferable that I make the changes to all the join operations in this PR or break them up into different ones?
We could just focus on mixed join for this PR. The goal is mainly to evaluate the performance impact and design of the new dispatching method.
Benchmark results. MR adds specialized dispatch for build and probe in case of hash joins, and only for build in case of mixed semi/anti joins. Other joins are not modified
# inner_join
## [0] NVIDIA A100 80GB PCIe
| Key | Nullable | left_size | right_size | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|-------|------------|-------------|--------------|------------|-------------|------------|-------------|--------------|---------|----------|
| I32 | 0 | 1000 | 1000 | 112.437 us | 3.93% | 107.061 us | 4.98% | -5.376 us | -4.78% | FAIL |
| I32 | 0 | 100000 | 1000 | 135.776 us | 2.02% | 128.506 us | 1.90% | -7.270 us | -5.35% | FAIL |
| I32 | 0 | 10000000 | 1000 | 3.058 ms | 0.46% | 2.167 ms | 0.45% | -890.462 us | -29.12% | FAIL |
| I32 | 0 | 100000 | 100000 | 156.405 us | 2.20% | 145.498 us | 1.21% | -10.907 us | -6.97% | FAIL |
| I32 | 0 | 10000000 | 100000 | 3.631 ms | 0.12% | 2.531 ms | 0.16% | -1100.242 us | -30.30% | FAIL |
| I32 | 0 | 10000000 | 10000000 | 6.655 ms | 0.06% | 5.073 ms | 0.09% | -1581.481 us | -23.77% | FAIL |
| I32 | 1 | 1000 | 1000 | 122.827 us | 1.43% | 124.232 us | 1.44% | 1.405 us | 1.14% | PASS |
| I32 | 1 | 100000 | 1000 | 139.361 us | 1.17% | 137.219 us | 3.06% | -2.142 us | -1.54% | FAIL |
| I32 | 1 | 10000000 | 1000 | 1.977 ms | 0.21% | 1.354 ms | 0.34% | -622.759 us | -31.51% | FAIL |
| I32 | 1 | 100000 | 100000 | 144.414 us | 1.21% | 143.193 us | 2.52% | -1.221 us | -0.85% | PASS |
| I32 | 1 | 10000000 | 100000 | 2.163 ms | 0.13% | 1.473 ms | 0.43% | -690.769 us | -31.93% | FAIL |
| I32 | 1 | 10000000 | 10000000 | 3.260 ms | 0.12% | 2.253 ms | 0.24% | -1006.706 us | -30.88% | FAIL |
| I64 | 0 | 1000 | 1000 | 114.109 us | 3.34% | 105.741 us | 3.71% | -8.368 us | -7.33% | FAIL |
| I64 | 0 | 100000 | 1000 | 136.939 us | 2.24% | 131.708 us | 1.55% | -5.230 us | -3.82% | FAIL |
| I64 | 0 | 10000000 | 1000 | 3.146 ms | 0.56% | 2.216 ms | 0.45% | -929.616 us | -29.55% | FAIL |
| I64 | 0 | 100000 | 100000 | 156.054 us | 1.20% | 146.700 us | 2.20% | -9.354 us | -5.99% | FAIL |
| I64 | 0 | 10000000 | 100000 | 3.715 ms | 0.14% | 2.581 ms | 0.17% | -1134.293 us | -30.53% | FAIL |
| I64 | 0 | 10000000 | 10000000 | 6.750 ms | 0.07% | 5.131 ms | 0.08% | -1618.389 us | -23.98% | FAIL |
| I64 | 1 | 1000 | 1000 | 123.094 us | 1.38% | 124.900 us | 1.39% | 1.805 us | 1.47% | FAIL |
| I64 | 1 | 100000 | 1000 | 141.180 us | 1.30% | 137.238 us | 3.05% | -3.942 us | -2.79% | FAIL |
| I64 | 1 | 10000000 | 1000 | 2.019 ms | 0.09% | 1.397 ms | 0.28% | -622.010 us | -30.81% | FAIL |
| I64 | 1 | 100000 | 100000 | 143.681 us | 1.33% | 144.351 us | 1.42% | 0.671 us | 0.47% | PASS |
| I64 | 1 | 10000000 | 100000 | 2.219 ms | 0.10% | 1.516 ms | 0.27% | -703.369 us | -31.69% | FAIL |
| I64 | 1 | 10000000 | 10000000 | 3.333 ms | 0.14% | 2.307 ms | 0.24% | -1025.560 us | -30.77% | FAIL |
# left_join
## [0] NVIDIA A100 80GB PCIe
| Key | Nullable | left_size | right_size | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|-------|------------|-------------|--------------|------------|-------------|------------|-------------|--------------|---------|----------|
| I32 | 0 | 1000 | 1000 | 111.987 us | 1.73% | 110.068 us | 3.30% | -1.919 us | -1.71% | PASS |
| I32 | 0 | 100000 | 1000 | 138.868 us | 3.10% | 128.825 us | 1.13% | -10.043 us | -7.23% | FAIL |
| I32 | 0 | 10000000 | 1000 | 3.241 ms | 0.53% | 2.276 ms | 0.50% | -965.407 us | -29.79% | FAIL |
| I32 | 0 | 100000 | 100000 | 157.198 us | 1.08% | 145.382 us | 1.09% | -11.816 us | -7.52% | FAIL |
| I32 | 0 | 10000000 | 100000 | 3.808 ms | 0.15% | 2.633 ms | 0.17% | -1174.401 us | -30.84% | FAIL |
| I32 | 0 | 10000000 | 10000000 | 6.859 ms | 0.06% | 5.204 ms | 0.08% | -1655.029 us | -24.13% | FAIL |
| I32 | 1 | 1000 | 1000 | 122.560 us | 1.50% | 124.198 us | 1.25% | 1.638 us | 1.34% | FAIL |
| I32 | 1 | 100000 | 1000 | 139.765 us | 1.22% | 139.785 us | 2.12% | 0.020 us | 0.01% | PASS |
| I32 | 1 | 10000000 | 1000 | 2.145 ms | 0.14% | 1.480 ms | 0.16% | -664.832 us | -31.00% | FAIL |
| I32 | 1 | 100000 | 100000 | 144.442 us | 1.29% | 144.435 us | 1.57% | -0.007 us | -0.00% | PASS |
| I32 | 1 | 10000000 | 100000 | 2.320 ms | 0.19% | 1.596 ms | 0.32% | -723.805 us | -31.20% | FAIL |
| I32 | 1 | 10000000 | 10000000 | 3.452 ms | 0.12% | 2.403 ms | 0.18% | -1048.513 us | -30.37% | FAIL |
| I64 | 0 | 1000 | 1000 | 112.913 us | 1.46% | 108.610 us | 3.88% | -4.303 us | -3.81% | FAIL |
| I64 | 0 | 100000 | 1000 | 142.333 us | 2.77% | 131.782 us | 1.10% | -10.551 us | -7.41% | FAIL |
| I64 | 0 | 10000000 | 1000 | 3.339 ms | 0.49% | 2.324 ms | 0.55% | -1014.754 us | -30.39% | FAIL |
| I64 | 0 | 100000 | 100000 | 156.852 us | 0.97% | 148.785 us | 2.70% | -8.066 us | -5.14% | FAIL |
| I64 | 0 | 10000000 | 100000 | 3.903 ms | 0.30% | 2.692 ms | 0.11% | -1211.272 us | -31.04% | FAIL |
| I64 | 0 | 10000000 | 10000000 | 6.956 ms | 0.06% | 5.262 ms | 0.09% | -1694.028 us | -24.35% | FAIL |
| I64 | 1 | 1000 | 1000 | 122.817 us | 1.29% | 124.738 us | 1.30% | 1.921 us | 1.56% | FAIL |
| I64 | 1 | 100000 | 1000 | 141.988 us | 1.36% | 141.088 us | 2.96% | -0.900 us | -0.63% | PASS |
| I64 | 1 | 10000000 | 1000 | 2.192 ms | 0.20% | 1.527 ms | 0.21% | -665.648 us | -30.36% | FAIL |
| I64 | 1 | 100000 | 100000 | 146.557 us | 2.33% | 144.878 us | 1.12% | -1.679 us | -1.15% | FAIL |
| I64 | 1 | 10000000 | 100000 | 2.383 ms | 0.15% | 1.640 ms | 0.16% | -743.069 us | -31.19% | FAIL |
| I64 | 1 | 10000000 | 10000000 | 3.524 ms | 0.13% | 2.461 ms | 0.20% | -1063.082 us | -30.17% | FAIL |
# full_join
## [0] NVIDIA A100 80GB PCIe
| Key | Nullable | left_size | right_size | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|-------|------------|-------------|--------------|------------|-------------|------------|-------------|--------------|---------|----------|
| I32 | 0 | 1000 | 1000 | 202.495 us | 1.68% | 202.082 us | 2.38% | -0.413 us | -0.20% | PASS |
| I32 | 0 | 100000 | 1000 | 190.106 us | 3.26% | 183.246 us | 1.01% | -6.861 us | -3.61% | FAIL |
| I32 | 0 | 10000000 | 1000 | 3.942 ms | 0.43% | 2.975 ms | 0.38% | -967.704 us | -24.55% | FAIL |
| I32 | 0 | 100000 | 100000 | 254.050 us | 0.86% | 243.268 us | 0.77% | -10.781 us | -4.24% | FAIL |
| I32 | 0 | 10000000 | 100000 | 3.963 ms | 0.15% | 2.784 ms | 0.18% | -1179.510 us | -29.76% | FAIL |
| I32 | 0 | 10000000 | 10000000 | 7.499 ms | 0.07% | 5.839 ms | 0.08% | -1659.243 us | -22.13% | FAIL |
| I32 | 1 | 1000 | 1000 | 211.467 us | 1.07% | 215.023 us | 0.97% | 3.556 us | 1.68% | FAIL |
| I32 | 1 | 100000 | 1000 | 230.887 us | 1.04% | 231.303 us | 1.21% | 0.416 us | 0.18% | PASS |
| I32 | 1 | 10000000 | 1000 | 2.440 ms | 0.14% | 1.741 ms | 0.16% | -698.441 us | -28.63% | FAIL |
| I32 | 1 | 100000 | 100000 | 244.139 us | 1.79% | 241.811 us | 1.26% | -2.328 us | -0.95% | PASS |
| I32 | 1 | 10000000 | 100000 | 2.564 ms | 0.19% | 1.836 ms | 0.33% | -728.032 us | -28.40% | FAIL |
| I32 | 1 | 10000000 | 10000000 | 3.909 ms | 0.11% | 2.859 ms | 0.17% | -1050.267 us | -26.87% | FAIL |
| I64 | 0 | 1000 | 1000 | 203.301 us | 1.10% | 199.310 us | 2.21% | -3.991 us | -1.96% | FAIL |
| I64 | 0 | 100000 | 1000 | 198.917 us | 2.13% | 187.892 us | 0.97% | -11.025 us | -5.54% | FAIL |
| I64 | 0 | 10000000 | 1000 | 3.866 ms | 0.38% | 2.860 ms | 0.44% | -1006.472 us | -26.03% | FAIL |
| I64 | 0 | 100000 | 100000 | 254.073 us | 0.94% | 247.261 us | 1.67% | -6.811 us | -2.68% | FAIL |
| I64 | 0 | 10000000 | 100000 | 4.039 ms | 0.13% | 2.833 ms | 0.18% | -1205.339 us | -29.85% | FAIL |
| I64 | 0 | 10000000 | 10000000 | 7.598 ms | 0.13% | 5.899 ms | 0.08% | -1699.405 us | -22.37% | FAIL |
| I64 | 1 | 1000 | 1000 | 212.579 us | 1.12% | 215.639 us | 1.05% | 3.059 us | 1.44% | FAIL |
| I64 | 1 | 100000 | 1000 | 233.085 us | 1.01% | 233.765 us | 1.77% | 0.680 us | 0.29% | PASS |
| I64 | 1 | 10000000 | 1000 | 2.453 ms | 0.18% | 1.787 ms | 0.21% | -665.259 us | -27.12% | FAIL |
| I64 | 1 | 100000 | 100000 | 241.901 us | 1.29% | 241.377 us | 0.89% | -0.524 us | -0.22% | PASS |
| I64 | 1 | 10000000 | 100000 | 2.622 ms | 0.15% | 1.878 ms | 0.15% | -743.917 us | -28.37% | FAIL |
| I64 | 1 | 10000000 | 10000000 | 3.981 ms | 0.13% | 2.919 ms | 0.19% | -1061.823 us | -26.67% | FAIL |
# mixed_inner_join
## [0] NVIDIA A100 80GB PCIe
| Key | Nullable | left_size | right_size | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|-------|------------|-------------|--------------|------------|-------------|------------|-------------|------------|---------|----------|
| I32 | 0 | 1000 | 1000 | 182.485 us | 1.83% | 180.562 us | 3.29% | -1.923 us | -1.05% | PASS |
| I32 | 0 | 100000 | 1000 | 209.398 us | 1.35% | 209.591 us | 1.11% | 0.194 us | 0.09% | PASS |
| I32 | 0 | 10000000 | 1000 | 4.271 ms | 0.41% | 4.269 ms | 0.35% | -2.265 us | -0.05% | PASS |
| I32 | 0 | 100000 | 100000 | 240.362 us | 2.06% | 237.976 us | 1.21% | -2.386 us | -0.99% | PASS |
| I32 | 0 | 10000000 | 100000 | 5.234 ms | 0.10% | 5.242 ms | 0.12% | 7.659 us | 0.15% | FAIL |
| I32 | 0 | 10000000 | 10000000 | 9.089 ms | 0.04% | 9.072 ms | 0.06% | -17.573 us | -0.19% | FAIL |
| I32 | 1 | 1000 | 1000 | 183.076 us | 2.18% | 188.788 us | 2.69% | 5.712 us | 3.12% | FAIL |
| I32 | 1 | 100000 | 1000 | 209.663 us | 1.11% | 212.000 us | 0.94% | 2.337 us | 1.11% | FAIL |
| I32 | 1 | 10000000 | 1000 | 2.745 ms | 0.14% | 2.731 ms | 0.14% | -13.553 us | -0.49% | FAIL |
| I32 | 1 | 100000 | 100000 | 214.241 us | 1.01% | 217.728 us | 0.99% | 3.487 us | 1.63% | FAIL |
| I32 | 1 | 10000000 | 100000 | 3.127 ms | 0.11% | 3.123 ms | 0.14% | -3.526 us | -0.11% | FAIL |
| I32 | 1 | 10000000 | 10000000 | 4.287 ms | 0.10% | 4.261 ms | 0.10% | -25.713 us | -0.60% | FAIL |
| I64 | 0 | 1000 | 1000 | 189.755 us | 2.19% | 188.822 us | 2.44% | -0.933 us | -0.49% | PASS |
| I64 | 0 | 100000 | 1000 | 228.300 us | 1.95% | 227.726 us | 1.19% | -0.574 us | -0.25% | PASS |
| I64 | 0 | 10000000 | 1000 | 4.731 ms | 0.38% | 4.711 ms | 0.40% | -20.407 us | -0.43% | FAIL |
| I64 | 0 | 100000 | 100000 | 247.229 us | 0.96% | 247.642 us | 0.92% | 0.412 us | 0.17% | PASS |
| I64 | 0 | 10000000 | 100000 | 5.502 ms | 0.10% | 5.508 ms | 0.12% | 5.581 us | 0.10% | FAIL |
| I64 | 0 | 10000000 | 10000000 | 9.276 ms | 0.06% | 9.253 ms | 0.06% | -22.964 us | -0.25% | FAIL |
| I64 | 1 | 1000 | 1000 | 198.411 us | 1.46% | 191.820 us | 4.01% | -6.591 us | -3.32% | FAIL |
| I64 | 1 | 100000 | 1000 | 213.582 us | 1.40% | 214.831 us | 0.99% | 1.249 us | 0.58% | PASS |
| I64 | 1 | 10000000 | 1000 | 2.819 ms | 0.13% | 2.816 ms | 0.17% | -3.475 us | -0.12% | PASS |
| I64 | 1 | 100000 | 100000 | 217.729 us | 1.54% | 218.700 us | 1.03% | 0.971 us | 0.45% | PASS |
| I64 | 1 | 10000000 | 100000 | 3.250 ms | 0.10% | 3.264 ms | 0.09% | 13.259 us | 0.41% | FAIL |
| I64 | 1 | 10000000 | 10000000 | 4.381 ms | 0.09% | 4.350 ms | 0.10% | -31.040 us | -0.71% | FAIL |
# mixed_left_join
## [0] NVIDIA A100 80GB PCIe
| Key | Nullable | left_size | right_size | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|-------|------------|-------------|--------------|------------|-------------|------------|-------------|------------|---------|----------|
| I32 | 0 | 1000 | 1000 | 181.258 us | 1.46% | 179.720 us | 1.06% | -1.538 us | -0.85% | PASS |
| I32 | 0 | 100000 | 1000 | 211.240 us | 2.40% | 212.913 us | 2.27% | 1.673 us | 0.79% | PASS |
| I32 | 0 | 10000000 | 1000 | 4.429 ms | 0.37% | 4.430 ms | 0.42% | 0.906 us | 0.02% | PASS |
| I32 | 0 | 100000 | 100000 | 242.579 us | 1.82% | 239.933 us | 1.89% | -2.646 us | -1.09% | PASS |
| I32 | 0 | 10000000 | 100000 | 5.400 ms | 0.10% | 5.408 ms | 0.10% | 8.566 us | 0.16% | FAIL |
| I32 | 0 | 10000000 | 10000000 | 9.276 ms | 0.04% | 9.257 ms | 0.06% | -18.119 us | -0.20% | FAIL |
| I32 | 1 | 1000 | 1000 | 185.553 us | 2.45% | 185.302 us | 1.57% | -0.251 us | -0.14% | PASS |
| I32 | 1 | 100000 | 1000 | 210.679 us | 1.00% | 212.486 us | 0.88% | 1.807 us | 0.86% | PASS |
| I32 | 1 | 10000000 | 1000 | 2.832 ms | 0.11% | 2.819 ms | 0.11% | -13.265 us | -0.47% | FAIL |
| I32 | 1 | 100000 | 100000 | 215.185 us | 0.94% | 217.684 us | 0.96% | 2.499 us | 1.16% | FAIL |
| I32 | 1 | 10000000 | 100000 | 3.209 ms | 0.12% | 3.205 ms | 0.10% | -4.749 us | -0.15% | FAIL |
| I32 | 1 | 10000000 | 10000000 | 4.376 ms | 0.09% | 4.351 ms | 0.10% | -25.938 us | -0.59% | FAIL |
| I64 | 0 | 1000 | 1000 | 187.852 us | 2.39% | 186.474 us | 2.21% | -1.378 us | -0.73% | PASS |
| I64 | 0 | 100000 | 1000 | 231.821 us | 1.94% | 230.232 us | 1.98% | -1.589 us | -0.69% | PASS |
| I64 | 0 | 10000000 | 1000 | 4.823 ms | 0.39% | 4.800 ms | 0.37% | -22.515 us | -0.47% | FAIL |
| I64 | 0 | 100000 | 100000 | 246.690 us | 0.91% | 247.402 us | 0.89% | 0.713 us | 0.29% | PASS |
| I64 | 0 | 10000000 | 100000 | 5.615 ms | 0.08% | 5.621 ms | 0.09% | 5.873 us | 0.10% | FAIL |
| I64 | 0 | 10000000 | 10000000 | 9.422 ms | 0.04% | 9.400 ms | 0.05% | -21.220 us | -0.23% | FAIL |
| I64 | 1 | 1000 | 1000 | 196.076 us | 2.18% | 191.561 us | 3.88% | -4.515 us | -2.30% | FAIL |
| I64 | 1 | 100000 | 1000 | 214.409 us | 1.26% | 216.247 us | 0.99% | 1.838 us | 0.86% | PASS |
| I64 | 1 | 10000000 | 1000 | 2.909 ms | 0.11% | 2.902 ms | 0.14% | -6.777 us | -0.23% | FAIL |
| I64 | 1 | 100000 | 100000 | 218.817 us | 1.45% | 220.194 us | 1.08% | 1.376 us | 0.63% | PASS |
| I64 | 1 | 10000000 | 100000 | 3.343 ms | 0.15% | 3.357 ms | 0.12% | 14.374 us | 0.43% | FAIL |
| I64 | 1 | 10000000 | 10000000 | 4.473 ms | 0.10% | 4.444 ms | 0.08% | -28.838 us | -0.64% | FAIL |
# mixed_full_join
## [0] NVIDIA A100 80GB PCIe
| Key | Nullable | left_size | right_size | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|-------|------------|-------------|--------------|------------|-------------|------------|-------------|------------|---------|----------|
| I32 | 0 | 1000 | 1000 | 277.374 us | 3.50% | 276.258 us | 0.86% | -1.116 us | -0.40% | PASS |
| I32 | 0 | 100000 | 1000 | 271.192 us | 1.98% | 271.640 us | 1.73% | 0.448 us | 0.17% | PASS |
| I32 | 0 | 10000000 | 1000 | 4.956 ms | 0.39% | 5.098 ms | 0.30% | 141.729 us | 2.86% | FAIL |
| I32 | 0 | 100000 | 100000 | 344.457 us | 1.47% | 343.632 us | 1.46% | -0.825 us | -0.24% | PASS |
| I32 | 0 | 10000000 | 100000 | 5.563 ms | 0.09% | 5.573 ms | 0.10% | 9.202 us | 0.17% | FAIL |
| I32 | 0 | 10000000 | 10000000 | 9.921 ms | 0.05% | 9.900 ms | 0.05% | -20.580 us | -0.21% | FAIL |
| I32 | 1 | 1000 | 1000 | 282.076 us | 2.04% | 281.979 us | 1.52% | -0.097 us | -0.03% | PASS |
| I32 | 1 | 100000 | 1000 | 307.431 us | 0.83% | 309.947 us | 0.88% | 2.516 us | 0.82% | PASS |
| I32 | 1 | 10000000 | 1000 | 3.108 ms | 0.12% | 3.096 ms | 0.10% | -11.565 us | -0.37% | FAIL |
| I32 | 1 | 100000 | 100000 | 318.134 us | 1.08% | 320.594 us | 0.86% | 2.459 us | 0.77% | PASS |
| I32 | 1 | 10000000 | 100000 | 3.456 ms | 0.12% | 3.450 ms | 0.09% | -6.271 us | -0.18% | FAIL |
| I32 | 1 | 10000000 | 10000000 | 4.839 ms | 0.10% | 4.815 ms | 0.09% | -24.349 us | -0.50% | FAIL |
| I64 | 0 | 1000 | 1000 | 285.060 us | 1.55% | 280.950 us | 1.34% | -4.110 us | -1.44% | FAIL |
| I64 | 0 | 100000 | 1000 | 295.759 us | 1.58% | 292.838 us | 1.62% | -2.921 us | -0.99% | PASS |
| I64 | 0 | 10000000 | 1000 | 5.353 ms | 0.31% | 5.335 ms | 0.37% | -17.438 us | -0.33% | FAIL |
| I64 | 0 | 100000 | 100000 | 349.404 us | 0.72% | 350.908 us | 0.73% | 1.505 us | 0.43% | PASS |
| I64 | 0 | 10000000 | 100000 | 5.772 ms | 0.09% | 5.779 ms | 0.10% | 7.690 us | 0.13% | FAIL |
| I64 | 0 | 10000000 | 10000000 | 10.063 ms | 0.05% | 10.047 ms | 0.06% | -16.077 us | -0.16% | FAIL |
| I64 | 1 | 1000 | 1000 | 290.786 us | 1.61% | 288.080 us | 2.59% | -2.705 us | -0.93% | PASS |
| I64 | 1 | 100000 | 1000 | 311.609 us | 1.04% | 313.071 us | 0.92% | 1.462 us | 0.47% | PASS |
| I64 | 1 | 10000000 | 1000 | 3.207 ms | 0.10% | 3.202 ms | 0.16% | -4.379 us | -0.14% | FAIL |
| I64 | 1 | 100000 | 100000 | 321.441 us | 1.28% | 322.809 us | 0.95% | 1.369 us | 0.43% | PASS |
| I64 | 1 | 10000000 | 100000 | 3.587 ms | 0.11% | 3.601 ms | 0.12% | 14.183 us | 0.40% | FAIL |
| I64 | 1 | 10000000 | 10000000 | 4.935 ms | 0.08% | 4.906 ms | 0.09% | -28.230 us | -0.57% | FAIL |
# mixed_left_semi_join
## [0] NVIDIA A100 80GB PCIe
| Key | Nullable | left_size | right_size | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|-------|------------|-------------|--------------|------------|-------------|------------|-------------|--------------|---------|----------|
| I32 | 0 | 1000 | 1000 | 163.941 us | 1.35% | 164.092 us | 1.08% | 0.151 us | 0.09% | PASS |
| I32 | 0 | 100000 | 1000 | 186.297 us | 1.72% | 187.596 us | 1.02% | 1.299 us | 0.70% | PASS |
| I32 | 0 | 10000000 | 1000 | 1.890 ms | 0.21% | 1.888 ms | 0.12% | -2.544 us | -0.13% | FAIL |
| I32 | 0 | 100000 | 100000 | 216.653 us | 0.97% | 209.246 us | 1.01% | -7.407 us | -3.42% | FAIL |
| I32 | 0 | 10000000 | 100000 | 2.206 ms | 0.12% | 2.187 ms | 0.12% | -19.241 us | -0.87% | FAIL |
| I32 | 0 | 10000000 | 10000000 | 6.849 ms | 0.05% | 5.866 ms | 0.07% | -983.054 us | -14.35% | FAIL |
| I32 | 1 | 1000 | 1000 | 178.305 us | 1.36% | 181.066 us | 1.69% | 2.761 us | 1.55% | FAIL |
| I32 | 1 | 100000 | 1000 | 196.463 us | 2.28% | 199.260 us | 2.01% | 2.796 us | 1.42% | PASS |
| I32 | 1 | 10000000 | 1000 | 1.469 ms | 0.29% | 1.456 ms | 0.32% | -13.259 us | -0.90% | FAIL |
| I32 | 1 | 100000 | 100000 | 223.007 us | 1.11% | 217.832 us | 1.04% | -5.175 us | -2.32% | FAIL |
| I32 | 1 | 10000000 | 100000 | 1.518 ms | 0.18% | 1.500 ms | 0.18% | -17.546 us | -1.16% | FAIL |
| I32 | 1 | 10000000 | 10000000 | 4.611 ms | 0.08% | 3.620 ms | 0.09% | -991.670 us | -21.50% | FAIL |
| I64 | 0 | 1000 | 1000 | 167.839 us | 1.14% | 167.587 us | 1.04% | -0.252 us | -0.15% | PASS |
| I64 | 0 | 100000 | 1000 | 190.487 us | 2.07% | 189.719 us | 0.98% | -0.768 us | -0.40% | PASS |
| I64 | 0 | 10000000 | 1000 | 2.076 ms | 0.21% | 2.055 ms | 0.11% | -21.682 us | -1.04% | FAIL |
| I64 | 0 | 100000 | 100000 | 224.296 us | 1.85% | 212.986 us | 0.97% | -11.311 us | -5.04% | FAIL |
| I64 | 0 | 10000000 | 100000 | 2.353 ms | 0.16% | 2.329 ms | 0.17% | -23.743 us | -1.01% | FAIL |
| I64 | 0 | 10000000 | 10000000 | 7.075 ms | 0.06% | 6.109 ms | 0.06% | -965.622 us | -13.65% | FAIL |
| I64 | 1 | 1000 | 1000 | 186.398 us | 1.55% | 181.063 us | 2.09% | -5.335 us | -2.86% | FAIL |
| I64 | 1 | 100000 | 1000 | 202.998 us | 1.05% | 198.829 us | 1.95% | -4.170 us | -2.05% | FAIL |
| I64 | 1 | 10000000 | 1000 | 1.415 ms | 0.18% | 1.407 ms | 0.29% | -8.203 us | -0.58% | FAIL |
| I64 | 1 | 100000 | 100000 | 223.472 us | 1.04% | 217.483 us | 0.95% | -5.989 us | -2.68% | FAIL |
| I64 | 1 | 10000000 | 100000 | 1.554 ms | 0.12% | 1.549 ms | 0.13% | -4.402 us | -0.28% | FAIL |
| I64 | 1 | 10000000 | 10000000 | 4.705 ms | 0.08% | 3.695 ms | 0.09% | -1010.020 us | -21.47% | FAIL |
# mixed_left_anti_join
## [0] NVIDIA A100 80GB PCIe
| Key | Nullable | left_size | right_size | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|-------|------------|-------------|--------------|------------|-------------|------------|-------------|--------------|---------|----------|
| I32 | 0 | 1000 | 1000 | 163.817 us | 1.10% | 164.318 us | 1.00% | 0.501 us | 0.31% | PASS |
| I32 | 0 | 100000 | 1000 | 186.833 us | 1.14% | 187.803 us | 1.32% | 0.970 us | 0.52% | PASS |
| I32 | 0 | 10000000 | 1000 | 1.899 ms | 0.14% | 1.895 ms | 0.12% | -4.678 us | -0.25% | FAIL |
| I32 | 0 | 100000 | 100000 | 216.741 us | 0.94% | 209.305 us | 0.99% | -7.436 us | -3.43% | FAIL |
| I32 | 0 | 10000000 | 100000 | 2.214 ms | 0.13% | 2.194 ms | 0.12% | -19.464 us | -0.88% | FAIL |
| I32 | 0 | 10000000 | 10000000 | 6.857 ms | 0.07% | 5.872 ms | 0.06% | -984.544 us | -14.36% | FAIL |
| I32 | 1 | 1000 | 1000 | 178.700 us | 1.35% | 181.034 us | 3.25% | 2.334 us | 1.31% | PASS |
| I32 | 1 | 100000 | 1000 | 197.369 us | 2.02% | 198.717 us | 1.93% | 1.348 us | 0.68% | PASS |
| I32 | 1 | 10000000 | 1000 | 1.480 ms | 0.33% | 1.465 ms | 0.33% | -15.310 us | -1.03% | FAIL |
| I32 | 1 | 100000 | 100000 | 223.360 us | 1.14% | 217.979 us | 1.48% | -5.382 us | -2.41% | FAIL |
| I32 | 1 | 10000000 | 100000 | 1.526 ms | 0.21% | 1.509 ms | 0.17% | -16.891 us | -1.11% | FAIL |
| I32 | 1 | 10000000 | 10000000 | 4.621 ms | 0.09% | 3.628 ms | 0.09% | -993.276 us | -21.49% | FAIL |
| I64 | 0 | 1000 | 1000 | 167.984 us | 1.27% | 167.400 us | 1.20% | -0.585 us | -0.35% | PASS |
| I64 | 0 | 100000 | 1000 | 191.022 us | 2.15% | 190.883 us | 1.14% | -0.138 us | -0.07% | PASS |
| I64 | 0 | 10000000 | 1000 | 2.083 ms | 0.22% | 2.063 ms | 0.11% | -20.791 us | -1.00% | FAIL |
| I64 | 0 | 100000 | 100000 | 224.755 us | 1.82% | 212.448 us | 1.05% | -12.307 us | -5.48% | FAIL |
| I64 | 0 | 10000000 | 100000 | 2.360 ms | 0.18% | 2.335 ms | 0.12% | -24.477 us | -1.04% | FAIL |
| I64 | 0 | 10000000 | 10000000 | 7.081 ms | 0.05% | 6.118 ms | 0.11% | -962.945 us | -13.60% | FAIL |
| I64 | 1 | 1000 | 1000 | 186.437 us | 1.73% | 181.559 us | 1.92% | -4.878 us | -2.62% | FAIL |
| I64 | 1 | 100000 | 1000 | 203.248 us | 1.12% | 199.537 us | 2.02% | -3.711 us | -1.83% | FAIL |
| I64 | 1 | 10000000 | 1000 | 1.423 ms | 0.19% | 1.417 ms | 0.31% | -6.537 us | -0.46% | FAIL |
| I64 | 1 | 100000 | 100000 | 223.797 us | 1.10% | 217.751 us | 0.93% | -6.046 us | -2.70% | FAIL |
| I64 | 1 | 10000000 | 100000 | 1.562 ms | 0.17% | 1.559 ms | 0.15% | -3.259 us | -0.21% | FAIL |
| I64 | 1 | 10000000 | 10000000 | 4.714 ms | 0.08% | 3.704 ms | 0.11% | -1009.649 us | -21.42% | FAIL |
# distinct_inner_join
## [0] NVIDIA A100 80GB PCIe
| Key | Nullable | left_size | right_size | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|-------|------------|-------------|--------------|------------|-------------|------------|-------------|-----------|---------|----------|
| I32 | 0 | 1000 | 1000 | 76.097 us | 2.56% | 75.916 us | 1.83% | -0.181 us | -0.24% | PASS |
| I32 | 0 | 100000 | 1000 | 83.753 us | 2.79% | 84.174 us | 1.28% | 0.421 us | 0.50% | PASS |
| I32 | 0 | 10000000 | 1000 | 1.066 ms | 0.29% | 1.065 ms | 0.13% | -1.073 us | -0.10% | PASS |
| I32 | 0 | 100000 | 100000 | 100.645 us | 1.82% | 99.474 us | 2.91% | -1.171 us | -1.16% | PASS |
| I32 | 0 | 10000000 | 100000 | 1.035 ms | 0.18% | 1.033 ms | 0.27% | -1.659 us | -0.16% | PASS |
| I32 | 0 | 10000000 | 10000000 | 3.761 ms | 0.09% | 3.763 ms | 0.09% | 2.090 us | 0.06% | PASS |
| I32 | 1 | 1000 | 1000 | 85.160 us | 2.51% | 86.673 us | 3.48% | 1.513 us | 1.78% | PASS |
| I32 | 1 | 100000 | 1000 | 92.722 us | 1.71% | 93.227 us | 1.50% | 0.505 us | 0.54% | PASS |
| I32 | 1 | 10000000 | 1000 | 530.329 us | 0.28% | 537.004 us | 0.30% | 6.675 us | 1.26% | FAIL |
| I32 | 1 | 100000 | 100000 | 92.981 us | 1.59% | 93.298 us | 1.49% | 0.317 us | 0.34% | PASS |
| I32 | 1 | 10000000 | 100000 | 587.844 us | 0.30% | 589.881 us | 0.25% | 2.037 us | 0.35% | FAIL |
| I32 | 1 | 10000000 | 10000000 | 1.239 ms | 0.27% | 1.238 ms | 0.29% | -1.266 us | -0.10% | PASS |
| I64 | 0 | 1000 | 1000 | 75.696 us | 1.47% | 75.792 us | 1.38% | 0.096 us | 0.13% | PASS |
| I64 | 0 | 100000 | 1000 | 84.752 us | 1.37% | 85.872 us | 1.24% | 1.120 us | 1.32% | FAIL |
| I64 | 0 | 10000000 | 1000 | 1.103 ms | 0.30% | 1.104 ms | 0.18% | 0.672 us | 0.06% | PASS |
| I64 | 0 | 100000 | 100000 | 98.002 us | 3.82% | 98.746 us | 3.56% | 0.744 us | 0.76% | PASS |
| I64 | 0 | 10000000 | 100000 | 1.059 ms | 0.33% | 1.061 ms | 0.36% | 1.604 us | 0.15% | PASS |
| I64 | 0 | 10000000 | 10000000 | 3.789 ms | 0.10% | 3.790 ms | 0.08% | 0.697 us | 0.02% | PASS |
| I64 | 1 | 1000 | 1000 | 84.873 us | 2.24% | 85.373 us | 1.81% | 0.500 us | 0.59% | PASS |
| I64 | 1 | 100000 | 1000 | 93.659 us | 1.96% | 94.275 us | 1.55% | 0.616 us | 0.66% | PASS |
| I64 | 1 | 10000000 | 1000 | 547.495 us | 0.57% | 550.995 us | 0.26% | 3.500 us | 0.64% | FAIL |
| I64 | 1 | 100000 | 100000 | 93.073 us | 1.67% | 93.745 us | 1.55% | 0.671 us | 0.72% | PASS |
| I64 | 1 | 10000000 | 100000 | 598.590 us | 0.65% | 600.007 us | 0.41% | 1.417 us | 0.24% | PASS |
| I64 | 1 | 10000000 | 10000000 | 1.258 ms | 0.28% | 1.259 ms | 0.26% | 1.130 us | 0.09% | PASS |
# distinct_left_join
## [0] NVIDIA A100 80GB PCIe
| Key | Nullable | left_size | right_size | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|-------|------------|-------------|--------------|------------|-------------|------------|-------------|-----------|---------|----------|
| I32 | 0 | 1000 | 1000 | 57.247 us | 1.87% | 57.142 us | 1.89% | -0.106 us | -0.18% | PASS |
| I32 | 0 | 100000 | 1000 | 60.621 us | 1.74% | 60.351 us | 1.62% | -0.270 us | -0.44% | PASS |
| I32 | 0 | 10000000 | 1000 | 772.056 us | 0.32% | 770.681 us | 0.19% | -1.376 us | -0.18% | PASS |
| I32 | 0 | 100000 | 100000 | 72.717 us | 1.99% | 72.192 us | 1.33% | -0.526 us | -0.72% | PASS |
| I32 | 0 | 10000000 | 100000 | 735.295 us | 0.15% | 734.951 us | 0.15% | -0.344 us | -0.05% | PASS |
| I32 | 0 | 10000000 | 10000000 | 3.313 ms | 0.09% | 3.315 ms | 0.08% | 1.946 us | 0.06% | PASS |
| I32 | 1 | 1000 | 1000 | 66.978 us | 1.88% | 67.808 us | 1.77% | 0.829 us | 1.24% | PASS |
| I32 | 1 | 100000 | 1000 | 68.109 us | 1.85% | 68.580 us | 1.78% | 0.471 us | 0.69% | PASS |
| I32 | 1 | 10000000 | 1000 | 322.185 us | 0.34% | 323.048 us | 0.35% | 0.863 us | 0.27% | PASS |
| I32 | 1 | 100000 | 100000 | 70.722 us | 1.80% | 71.232 us | 1.76% | 0.510 us | 0.72% | PASS |
| I32 | 1 | 10000000 | 100000 | 381.917 us | 0.35% | 382.202 us | 0.36% | 0.286 us | 0.07% | PASS |
| I32 | 1 | 10000000 | 10000000 | 1.030 ms | 0.36% | 1.029 ms | 0.23% | -1.579 us | -0.15% | PASS |
| I64 | 0 | 1000 | 1000 | 55.762 us | 1.78% | 55.070 us | 1.79% | -0.693 us | -1.24% | PASS |
| I64 | 0 | 100000 | 1000 | 59.585 us | 1.60% | 59.233 us | 1.52% | -0.352 us | -0.59% | PASS |
| I64 | 0 | 10000000 | 1000 | 794.808 us | 0.16% | 795.858 us | 0.16% | 1.050 us | 0.13% | PASS |
| I64 | 0 | 100000 | 100000 | 73.336 us | 2.07% | 72.852 us | 1.85% | -0.485 us | -0.66% | PASS |
| I64 | 0 | 10000000 | 100000 | 750.184 us | 0.17% | 749.326 us | 0.20% | -0.858 us | -0.11% | PASS |
| I64 | 0 | 10000000 | 10000000 | 3.333 ms | 0.08% | 3.333 ms | 0.07% | 0.393 us | 0.01% | PASS |
| I64 | 1 | 1000 | 1000 | 66.907 us | 1.83% | 66.763 us | 1.78% | -0.144 us | -0.22% | PASS |
| I64 | 1 | 100000 | 1000 | 67.905 us | 1.77% | 68.871 us | 1.80% | 0.966 us | 1.42% | PASS |
| I64 | 1 | 10000000 | 1000 | 336.761 us | 0.35% | 336.860 us | 0.41% | 0.099 us | 0.03% | PASS |
| I64 | 1 | 100000 | 100000 | 71.858 us | 1.87% | 72.272 us | 1.79% | 0.414 us | 0.58% | PASS |
| I64 | 1 | 10000000 | 100000 | 395.118 us | 0.39% | 396.264 us | 0.32% | 1.147 us | 0.29% | PASS |
| I64 | 1 | 10000000 | 10000000 | 1.045 ms | 0.21% | 1.046 ms | 0.21% | 0.819 us | 0.08% | PASS |
# Summary
- Total Matches: 240
- Pass (diff <= min_noise): 99
- Unknown (infinite noise): 0
- Failure (diff > min_noise): 141
/ok to test
I think as the PR is currently, this should have the breaking
label. The type for murmur_device_row_hasher
has to be modified in spark-rapids-jni.
/ok to test
/ok to test
/ok to test
/ok to test
/ok to test
/ok to test
/ok to test
@tgujar can you please resolve the merge conflicts against ToT? The build time still appears to be an issue, but we need a successful CI run to confirm.
@tgujar can you please resolve the merge conflicts against ToT? The build time still appears to be an issue, but we need a successful CI run to confirm.
Unsure how to handle this. https://github.com/rapidsai/cudf/pull/16603 says that we would like the launch and compilation to happen in the same TU for CUDA whole compilation mode. In this PR case, it means that all the instantiation of the kernels happen in same TU. But we split the instantiation in this PR to reduce compilation time for mixed semi join kernels. I think multiple launch functions wouldn't be good design.
@tgujar can you please resolve the merge conflicts against ToT? The build time still appears to be an issue, but we need a successful CI run to confirm.
Unsure how to handle this. #16603 says that we would like the launch and compilation to happen in the same TU for CUDA whole compilation mode. In this PR case, it means that all the instantiation of the kernels happen in same TU. But we split the instantiation in this PR to reduce compilation time for mixed semi join kernels. I think multiple launch functions wouldn't be good design.
You should be able to follow the updated pattern seen in cpp/src/join/mixed_join_kernel_nulls.cu
, cpp/src/join/mixed_join_kernel.cu
, cpp/src/join/mixed_join_kernel.cuh
, and cpp/src/join/mixed_join_kernel.hpp
.
That restructing has us separate TU's for the mixed join kernel based on the nullability of the input. This was done by having the intermidate host launch code have a specilization in each TU.