cudf
cudf copied to clipboard
List lexicographic comparator
Contributes to #10186
This PR enables lexicographic comparisons between list columns. The comparison is robust to arbitrary levels of nesting, but does not yet support lists of (lists of...) structs. The comparison is based on the Dremel encoding already in use in the Parquet file format. To assist reviewers, here is a reasonably complete list of the changes in this PR:
- A helper function to get per-column Dremel data (for list columns) when constructing a preprocessed table, which now owns the Dremel data.
- Updating the set of lexicographically compatible columns to now include list columns as long as they do not have any nested structs within.
- An implementation of
lexicographic::device_row_comparator::operator()for lists. This function is the heart of the change to enable comparisons between list columns. - A new benchmark for sorting that uses list data.
- An update to a preexisting rolling collect set test that previously failed (because it requires list comparison) but now works.
- New tests for list comparison.
@bdice @ttnghia I think this is ready for another round of review now. I believe that I've addressed the review requests.
Codecov Report
:exclamation: No coverage uploaded for pull request base (
branch-22.10@03f3543). Click here to learn what that means. Patch has no changes to coverable lines.
Additional details and impacted files
@@ Coverage Diff @@
## branch-22.10 #11129 +/- ##
===============================================
Coverage ? 86.41%
===============================================
Files ? 145
Lines ? 23000
Branches ? 0
===============================================
Hits ? 19876
Misses ? 3124
Partials ? 0
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
I wanted to post a couple nice links about Dremel encoding, since this is a core part of this PR.
Summary of the encoding with nice diagrams: https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet Original paper: https://research.google/pubs/pub36632/
The short description is that repetition levels encode hierarchy (nested depth), and definition levels encode nullity.
I didn't review all code yet. Given the time constraint, and no C++ reviewers approved yet, I'm a bit worried if we can merge this by tomorrow as a careful review may take much more time in several more review iterations. Although the code is already good, there may be some nits and optimizations that need to be addressed.
I'm totally fine with merging this first then polish later in a follow-up PR, given that all unit tests in this PR pass. If other people agree, let's concentrate all the remaining time on improving test coverage. I understand that this is a bit risky thus it may or may not be accepted. So in the meantime, I'll continue reviewing all the code.
I wanted to post a couple nice links about Dremel encoding, since this is a core part of this PR.
Summary of the encoding with nice diagrams: https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet Original paper: https://research.google/pubs/pub36632/
The short description is that repetition levels encode hierarchy (nested depth), and definition levels encode nullity.
@bdice FYI I tossed a few similar links into the docstring for get_dremel_data.
To summarize the results of our offline discussion:
- We are going to push this PR to 22.10 to avoid any unforeseen performance implications for other code paths that are already using
experimental::lexicographic::[self|two_table]_comparator(e.g. struct binaryops). - All refactoring associated with code in dremel.hpp or dremel.cu will be done in a separate PR after this PR is merged. I will leave any review comments on this PR requesting that kind of refactoring open so that those tasks can be addressed and the conversations resolved later (for tracking purposes).
- The primary remaining focus of this PR is adding more tests to validate the behavior of this code on edge cases and benchmarking thoroughly to understand the performance implications of these changes.
Here are the benchmarks results for sorting structs:
Benchmarks
| NumRows | Depth | Nulls | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|-----------|---------|---------|------------|-------------|------------|-------------|------------|---------|----------|
| 2^10 | 0 | 0 | 63.517 us | 73.30% | 48.075 us | 2.06% | -15.442 us | -24.31% | FAIL |
| 2^18 | 0 | 0 | 171.454 us | 1.24% | 177.255 us | 1.14% | 5.800 us | 3.38% | FAIL |
| 2^26 | 0 | 0 | 27.694 ms | 0.06% | 27.711 ms | 0.12% | 16.778 us | 0.06% | FAIL |
| 2^10 | 1 | 0 | 284.587 us | 12.50% | 361.627 us | 0.35% | 77.040 us | 27.07% | FAIL |
| 2^18 | 1 | 0 | 1.196 ms | 0.29% | 2.164 ms | 1.76% | 967.938 us | 80.95% | FAIL |
| 2^26 | 1 | 0 | 332.593 ms | 0.70% | 660.864 ms | 0.33% | 328.271 ms | 98.70% | FAIL |
| 2^10 | 8 | 0 | 668.827 us | 0.18% | 668.484 us | 0.35% | -0.343 us | -0.05% | PASS |
| 2^18 | 8 | 0 | 3.317 ms | 0.26% | 4.268 ms | 1.43% | 951.289 us | 28.68% | FAIL |
| 2^26 | 8 | 0 | 830.883 ms | 0.26% | 1.212 s | 0.34% | 381.604 ms | 45.93% | FAIL |
| 2^10 | 0 | 1 | 82.359 us | 0.82% | 81.364 us | 0.94% | -0.995 us | -1.21% | FAIL |
| 2^18 | 0 | 1 | 485.130 us | 0.86% | 504.918 us | 6.73% | 19.787 us | 4.08% | FAIL |
| 2^26 | 0 | 1 | 188.069 ms | 0.57% | 195.386 ms | 1.01% | 7.318 ms | 3.89% | FAIL |
| 2^10 | 1 | 1 | 299.367 us | 0.36% | 377.542 us | 0.39% | 78.175 us | 26.11% | FAIL |
| 2^18 | 1 | 1 | 1.215 ms | 0.48% | 2.188 ms | 2.20% | 972.830 us | 80.06% | FAIL |
| 2^26 | 1 | 1 | 339.462 ms | 0.34% | 667.672 ms | 0.50% | 328.210 ms | 96.69% | FAIL |
| 2^10 | 8 | 1 | 680.116 us | 4.05% | 682.444 us | 0.37% | 2.328 us | 0.34% | PASS |
| 2^18 | 8 | 1 | 3.304 ms | 0.46% | 4.293 ms | 1.14% | 988.216 us | 29.91% | FAIL |
| 2^26 | 8 | 1 | 839.208 ms | 0.27% | 1.209 s | 0.15% | 369.619 ms | 44.04% | FAIL |
and for sorting primitive types:
Benchmarks
Comparing benches/SORT_BENCH_branch-22.10.json to benches/SORT_BENCH_list_lex_branch.json
Benchmark Time CPU Time Old Time New CPU Old CPU New
---------------------------------------------------------------------------------------------------------------------------------------------------
Sort<false>/unstable_no_nulls/1024/1/manual_time -0.0065 -0.0051 0 0 0 0
Sort<false>/unstable_no_nulls/4096/1/manual_time +0.0020 +0.0003 0 0 0 0
Sort<false>/unstable_no_nulls/32768/1/manual_time +0.0031 +0.0020 0 0 0 0
Sort<false>/unstable_no_nulls/262144/1/manual_time -0.0026 -0.0016 0 0 0 0
Sort<false>/unstable_no_nulls/2097152/1/manual_time +0.0008 +0.0006 1 1 1 1
Sort<false>/unstable_no_nulls/16777216/1/manual_time +0.0014 +0.0014 7 7 7 7
Sort<false>/unstable_no_nulls/67108864/1/manual_time +0.0004 +0.0004 28 28 28 28
Sort<false>/unstable_no_nulls/1024/8/manual_time +0.3182 +0.3066 1 1 1 1
Sort<false>/unstable_no_nulls/4096/8/manual_time +0.6280 +0.6121 1 2 1 2
Sort<false>/unstable_no_nulls/32768/8/manual_time +0.8153 +0.7980 1 2 1 2
Sort<false>/unstable_no_nulls/262144/8/manual_time +0.8925 +0.8856 3 6 3 6
Sort<false>/unstable_no_nulls/2097152/8/manual_time +0.4395 +0.4393 53 76 53 76
Sort<false>/unstable_no_nulls/16777216/8/manual_time +0.3813 +0.3813 673 929 673 929
Sort<false>/unstable_no_nulls/67108864/8/manual_time +0.3413 +0.3413 3506 4703 3506 4702
Sort<true>/stable_no_nulls/1024/1/manual_time +0.0045 +0.0006 0 0 0 0
Sort<true>/stable_no_nulls/4096/1/manual_time -0.0028 -0.0036 0 0 0 0
Sort<true>/stable_no_nulls/32768/1/manual_time -0.0003 -0.0008 0 0 0 0
Sort<true>/stable_no_nulls/262144/1/manual_time -0.0027 -0.0014 0 0 0 0
Sort<true>/stable_no_nulls/2097152/1/manual_time +0.0010 +0.0010 1 1 1 1
Sort<true>/stable_no_nulls/16777216/1/manual_time +0.0002 +0.0002 7 7 7 7
Sort<true>/stable_no_nulls/67108864/1/manual_time +0.0006 +0.0006 28 28 28 28
Sort<true>/stable_no_nulls/1024/8/manual_time +0.3178 +0.3065 1 1 1 1
Sort<true>/stable_no_nulls/4096/8/manual_time +0.6353 +0.6198 1 2 1 2
Sort<true>/stable_no_nulls/32768/8/manual_time +0.8233 +0.8060 1 2 1 2
Sort<true>/stable_no_nulls/262144/8/manual_time +0.8936 +0.8867 3 6 3 6
Sort<true>/stable_no_nulls/2097152/8/manual_time +0.4424 +0.4422 53 77 53 77
Sort<true>/stable_no_nulls/16777216/8/manual_time +0.3858 +0.3858 673 932 673 932
Sort<true>/stable_no_nulls/67108864/8/manual_time +0.3435 +0.3435 3507 4711 3507 4711
Sort<false>/unstable/1024/1/manual_time -0.0004 -0.0007 0 0 0 0
Sort<false>/unstable/4096/1/manual_time +0.0124 +0.0109 0 0 0 0
Sort<false>/unstable/32768/1/manual_time +0.0381 +0.0337 0 0 0 0
Sort<false>/unstable/262144/1/manual_time -0.0168 -0.0160 0 0 1 1
Sort<false>/unstable/2097152/1/manual_time -0.0183 -0.0183 5 5 5 5
Sort<false>/unstable/16777216/1/manual_time -0.0119 -0.0119 44 43 44 43
Sort<false>/unstable/67108864/1/manual_time -0.0147 -0.0147 182 180 183 180
Sort<false>/unstable/1024/8/manual_time +0.3256 +0.3155 1 1 1 1
Sort<false>/unstable/4096/8/manual_time +0.5155 +0.5053 1 2 1 2
Sort<false>/unstable/32768/8/manual_time +0.6511 +0.6408 2 3 2 3
Sort<false>/unstable/262144/8/manual_time +0.7431 +0.7385 4 7 4 7
Sort<false>/unstable/2097152/8/manual_time +0.4393 +0.4391 61 87 61 87
Sort<false>/unstable/16777216/8/manual_time +0.3519 +0.3519 875 1183 875 1183
Sort<false>/unstable/67108864/8/manual_time +0.2847 +0.2847 5017 6446 5017 6445
Sort<true>/stable/1024/1/manual_time +0.0004 -0.0023 0 0 0 0
Sort<true>/stable/4096/1/manual_time +0.0094 +0.0086 0 0 0 0
Sort<true>/stable/32768/1/manual_time +0.0369 +0.0323 0 0 0 0
Sort<true>/stable/262144/1/manual_time -0.0108 -0.0106 1 0 1 1
Sort<true>/stable/2097152/1/manual_time -0.0152 -0.0152 6 5 6 5
Sort<true>/stable/16777216/1/manual_time -0.0138 -0.0138 44 44 45 44
Sort<true>/stable/67108864/1/manual_time -0.0112 -0.0112 184 182 184 182
Sort<true>/stable/1024/8/manual_time +0.3161 +0.3064 1 1 1 1
Sort<true>/stable/4096/8/manual_time +0.5174 +0.5076 1 2 1 2
Sort<true>/stable/32768/8/manual_time +0.6486 +0.6384 2 3 2 3
Sort<true>/stable/262144/8/manual_time +0.7375 +0.7330 4 7 4 7
Sort<true>/stable/2097152/8/manual_time +0.4439 +0.4437 61 88 61 88
Sort<true>/stable/16777216/8/manual_time +0.3558 +0.3558 876 1187 875 1187
Sort<true>/stable/67108864/8/manual_time +0.2855 +0.2855 5021 6455 5021 6454
Sort/strings/1024/manual_time -0.0066 -0.0066 1 1 1 1
Sort/strings/4096/manual_time +0.0040 +0.0039 2 2 2 2
Sort/strings/32768/manual_time -0.0001 -0.0001 2 2 2 2
Sort/strings/262144/manual_time -0.0069 -0.0069 8 8 8 8
Sort/strings/2097152/manual_time -0.0068 -0.0068 95 94 95 94
Sort/strings/16777216/manual_time -0.0089 -0.0089 862 854 862 854
~I think we can safely say that primitive types are unaffected~, but the struct results certainly look suspicious. This probably merits some further investigation, since I could easily have done something wrong, but for now I'm glad we're not pushing this through at this moment.
EDIT: The claim about primitive types is completely bogus. I don't know what I did the first time around, probably forgot to recompile, but based on multiple checks now primitive type performance definitely is affected.
It does indeed look like simply commenting out the list_view specialization of the code makes the struct code path faster. The overhead of constructing the Dremel data is negligible, I can reproduce this with just commenting out that specialization. I'll investigate further to see if there are inlining issues.
It does indeed look like simply commenting out the
list_viewspecialization of the code makes the struct code path faster. The overhead of constructing the Dremel data is negligible, I can reproduce this with just commenting out that specialization. I'll investigate further to see if there are inlining issues.
I'm guessing that code path is inflating your register usage and you're falling off an occupancy ledge.
It does indeed look like simply commenting out the
list_viewspecialization of the code makes the struct code path faster. The overhead of constructing the Dremel data is negligible, I can reproduce this with just commenting out that specialization. I'll investigate further to see if there are inlining issues.I'm guessing that code path is inflating your register usage and you're falling off an occupancy ledge.
Yeah that's part of what I'm seeing, it balloons register usage by about 30-40. It also substantially increases the stack usage. Since this is all using thrust algorithms, and given the nested structures of the comparators, it's a bit tricky to test approaches to alleviate the issue (although of course we can't prevent it entirely). We can't control the launch bounds for the kernel directly on a thrust invocation. Furthermore, since we see increased register and stack usage it seems like any change around forceinlining is going to be trading off between these two effects unless the compiler inlines code in a way that simultaneously reduces the stack and allows more efficient reuse of registers. I'm hunting for whether there's maybe more const data that could be passed correctly to help the compiler reduce the amount of local state, and whether there's ways to rewrite the internals of the operator in a way that gives the compiler more hints as to how much local state really needs to be maintained.
I spent some time today investigating different versions of the compilation and got very suspicious at how my benchmarks could possibly have shown that primitive types were unaffected when every version of the kernel looked much more resource-hungry than before. After running through the benchmarks again multiple times today, there was definitely something wrong with how I ran the benchmarks the first time. The new code path definitely adversely affects performance for primitive types. I'm not sure what I did the first time around, but this result definitely makes more sense given that the complexity of the kernel is higher by default for all types. I'm going to do some cleanup to address the outstanding comments while I brainstorm potential solutions, but we definitely do not want to merge this in its current form.
To minimize the amount of friction between this PR, ongoing Parquet development, and the desire to refactor Dremel encoding code, I have separated out the Dremel changes into #11461. We can review and merge that separately, then merge those changes down into this PR so that we have a minimal relevant changeset here to evaluate.
The new code path definitely adversely affects performance for primitive types.
How much are we talking?
The comparison results are in this earlier comment. I always put benchmark results into dropdowns that look like
Benchmarks
Benchmarks go here
because I find GH PR conversations filled with benchmark results to be too cluttered to be readable, but maybe I'm just hiding them entirely. Did you (or any other reviewers) even notice that was there?
So I've tried giving the compiler more explicit hints that the list and struct comparator code paths will never call to each other (e.g. by separating out the primitive comparison logic into a separate type so that no branch of the element_comparator ever creates other element_comparator instances), but no luck so far. It seems like dispatch_void_if_nested is doing its job. I'm now working through addressing review comments and simplifying the code in hopes of reducing register usage that way, but haven't had much luck. The total complexity of the device function just seems too high now.
Our best option at the moment seems like separating out entirely the code paths for nested and non-nested data. This approach works, but has a number of downsides: -We will have two separate types of comparators and it'll be up to calling code to decide whether to instantiate the constructor that supports nested types or not.
- The lexicographic comparators will now diverge from our equality comparators since there won't just be a single class anymore.
- Calling code will now have to determine which comparator to call beforehand, so anything using these comparators will now become a little more complex.
I would have been fine with this approach if it was clearly temporary, but at the moment I don't have an alternate solution to propose, so it's not like this would be a temporary patch awaiting some slightly more complex and time-consuming fix. If we go with the split comparator approach, I think we have to assume that it's not a temporary solution, and then if someone comes up with an alternative in the future we consider ourselves lucky. That makes me significantly more uneasy.
I'm afraid that I wasn't quite able to get through this today. There are two main outstanding tasks here:
- Split the
device_row_comparatorinto two implementations, one that supports nesting and one that does not. This also needs to propagate to thelessmethod of theself_comparatorand thetwo_table_comparator. - Find a better way to pass along the
dremel_device_viewvector that handles the non-list columns. Currently non-list columns default construct the struct, which places it in an invalid state, and we just rely on the comparator never dereferencing those elements of the vector. We should probably use something like athrust::optionalhere, but I had issues gettingdevice_spanandthrust::optionalto play nicely together.
I plan to pick this back up this week.
OK, I've managed to resolve the two major concerns above.
Here are the new benchmarks for structs:
Benchmarks
## [0] Tesla T4
| NumRows | Depth | Nulls | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|-----------|---------|---------|------------|-------------|------------|-------------|------------|---------|----------|
| 2^10 | 0 | 0 | 63.517 us | 73.30% | 50.508 us | 3.83% | -13.010 us | -20.48% | FAIL |
| 2^18 | 0 | 0 | 171.454 us | 1.24% | 172.818 us | 1.42% | 1.364 us | 0.80% | PASS |
| 2^26 | 0 | 0 | 27.694 ms | 0.06% | 27.732 ms | 0.10% | 38.183 us | 0.14% | FAIL |
| 2^10 | 1 | 0 | 284.587 us | 12.50% | 367.027 us | 0.72% | 82.441 us | 28.97% | FAIL |
| 2^18 | 1 | 0 | 1.196 ms | 0.29% | 1.919 ms | 1.23% | 723.708 us | 60.52% | FAIL |
| 2^26 | 1 | 0 | 332.593 ms | 0.70% | 569.225 ms | 0.49% | 236.632 ms | 71.15% | FAIL |
| 2^10 | 8 | 0 | 668.827 us | 0.18% | 675.271 us | 0.65% | 6.444 us | 0.96% | FAIL |
| 2^18 | 8 | 0 | 3.317 ms | 0.26% | 3.954 ms | 1.25% | 637.735 us | 19.23% | FAIL |
| 2^26 | 8 | 0 | 830.883 ms | 0.26% | 1.096 s | 0.28% | 265.034 ms | 31.90% | FAIL |
| 2^10 | 0 | 1 | 82.359 us | 0.82% | 82.959 us | 1.91% | 0.601 us | 0.73% | PASS |
| 2^18 | 0 | 1 | 485.130 us | 0.86% | 502.331 us | 1.43% | 17.201 us | 3.55% | FAIL |
| 2^26 | 0 | 1 | 188.069 ms | 0.57% | 194.755 ms | 0.93% | 6.687 ms | 3.56% | FAIL |
| 2^10 | 1 | 1 | 299.367 us | 0.36% | 380.822 us | 0.55% | 81.455 us | 27.21% | FAIL |
| 2^18 | 1 | 1 | 1.215 ms | 0.48% | 1.953 ms | 1.44% | 737.560 us | 60.70% | FAIL |
| 2^26 | 1 | 1 | 339.462 ms | 0.34% | 578.125 ms | 0.46% | 238.664 ms | 70.31% | FAIL |
| 2^10 | 8 | 1 | 680.116 us | 4.05% | 688.269 us | 0.53% | 8.153 us | 1.20% | FAIL |
| 2^18 | 8 | 1 | 3.304 ms | 0.46% | 3.997 ms | 1.25% | 692.167 us | 20.95% | FAIL |
| 2^26 | 8 | 1 | 839.208 ms | 0.27% | 1.100 s | 0.17% | 261.256 ms | 31.13% | FAIL |
and for sorting primitive types:
Benchmarks
Benchmark Time CPU Time Old Time New CPU Old CPU New
---------------------------------------------------------------------------------------------------------------------------------------------------
Sort<false>/unstable_no_nulls/1024/1/manual_time +0.0047 +0.0054 0 0 0 0
Sort<false>/unstable_no_nulls/4096/1/manual_time +0.0180 +0.0131 0 0 0 0
Sort<false>/unstable_no_nulls/32768/1/manual_time -0.0016 +0.0017 0 0 0 0
Sort<false>/unstable_no_nulls/262144/1/manual_time -0.0311 -0.0255 0 0 0 0
Sort<false>/unstable_no_nulls/2097152/1/manual_time -0.0011 -0.0007 1 1 1 1
Sort<false>/unstable_no_nulls/16777216/1/manual_time +0.0014 +0.0015 7 7 7 7
Sort<false>/unstable_no_nulls/67108864/1/manual_time +0.0003 +0.0003 28 28 28 28
Sort<false>/unstable_no_nulls/1024/8/manual_time -0.0333 -0.0321 1 1 1 1
Sort<false>/unstable_no_nulls/4096/8/manual_time -0.0625 -0.0608 1 1 1 1
Sort<false>/unstable_no_nulls/32768/8/manual_time -0.0439 -0.0427 1 1 1 1
Sort<false>/unstable_no_nulls/262144/8/manual_time -0.0212 -0.0210 3 3 3 3
Sort<false>/unstable_no_nulls/2097152/8/manual_time -0.0090 -0.0090 53 53 53 53
Sort<false>/unstable_no_nulls/16777216/8/manual_time -0.0016 -0.0016 673 672 673 672
Sort<false>/unstable_no_nulls/67108864/8/manual_time -0.0005 -0.0006 3506 3504 3506 3504
Sort<true>/stable_no_nulls/1024/1/manual_time +0.0139 +0.0073 0 0 0 0
Sort<true>/stable_no_nulls/4096/1/manual_time +0.0145 +0.0103 0 0 0 0
Sort<true>/stable_no_nulls/32768/1/manual_time +0.0075 +0.0078 0 0 0 0
Sort<true>/stable_no_nulls/262144/1/manual_time -0.0300 -0.0243 0 0 0 0
Sort<true>/stable_no_nulls/2097152/1/manual_time -0.0027 -0.0024 1 1 1 1
Sort<true>/stable_no_nulls/16777216/1/manual_time +0.0028 +0.0029 7 7 7 7
Sort<true>/stable_no_nulls/67108864/1/manual_time +0.0003 +0.0003 28 28 28 28
Sort<true>/stable_no_nulls/1024/8/manual_time -0.0329 -0.0316 1 1 1 1
Sort<true>/stable_no_nulls/4096/8/manual_time -0.0573 -0.0555 1 1 1 1
Sort<true>/stable_no_nulls/32768/8/manual_time -0.0399 -0.0388 1 1 1 1
Sort<true>/stable_no_nulls/262144/8/manual_time -0.0226 -0.0224 3 3 3 3
Sort<true>/stable_no_nulls/2097152/8/manual_time -0.0069 -0.0069 53 53 53 53
Sort<true>/stable_no_nulls/16777216/8/manual_time -0.0002 -0.0001 673 672 673 672
Sort<true>/stable_no_nulls/67108864/8/manual_time -0.0007 -0.0007 3507 3504 3507 3504
Sort<false>/unstable/1024/1/manual_time +0.0016 +0.0032 0 0 0 0
Sort<false>/unstable/4096/1/manual_time +0.0191 +0.0183 0 0 0 0
Sort<false>/unstable/32768/1/manual_time +0.0431 +0.0414 0 0 0 0
Sort<false>/unstable/262144/1/manual_time -0.0219 -0.0201 0 0 1 1
Sort<false>/unstable/2097152/1/manual_time -0.0212 -0.0211 5 5 5 5
Sort<false>/unstable/16777216/1/manual_time -0.0200 -0.0200 44 43 44 43
Sort<false>/unstable/67108864/1/manual_time -0.0173 -0.0173 182 179 183 179
Sort<false>/unstable/1024/8/manual_time -0.0048 -0.0047 1 1 1 1
Sort<false>/unstable/4096/8/manual_time -0.0368 -0.0362 1 1 1 1
Sort<false>/unstable/32768/8/manual_time -0.0155 -0.0151 2 2 2 2
Sort<false>/unstable/262144/8/manual_time -0.0211 -0.0210 4 4 4 4
Sort<false>/unstable/2097152/8/manual_time -0.0061 -0.0061 61 60 61 60
Sort<false>/unstable/16777216/8/manual_time +0.0028 +0.0028 875 878 875 877
Sort<false>/unstable/67108864/8/manual_time -0.0018 -0.0018 5017 5008 5017 5008
Sort<true>/stable/1024/1/manual_time +0.0039 +0.0024 0 0 0 0
Sort<true>/stable/4096/1/manual_time +0.0191 +0.0173 0 0 0 0
Sort<true>/stable/32768/1/manual_time +0.0409 +0.0384 0 0 0 0
Sort<true>/stable/262144/1/manual_time -0.0197 -0.0183 1 0 1 1
Sort<true>/stable/2097152/1/manual_time -0.0214 -0.0214 6 5 6 5
Sort<true>/stable/16777216/1/manual_time -0.0186 -0.0186 44 44 45 44
Sort<true>/stable/67108864/1/manual_time -0.0177 -0.0177 184 181 184 181
Sort<true>/stable/1024/8/manual_time -0.0092 -0.0091 1 1 1 1
Sort<true>/stable/4096/8/manual_time -0.0374 -0.0367 1 1 1 1
Sort<true>/stable/32768/8/manual_time -0.0172 -0.0168 2 2 2 2
Sort<true>/stable/262144/8/manual_time -0.0201 -0.0200 4 4 4 4
Sort<true>/stable/2097152/8/manual_time -0.0048 -0.0048 61 61 61 61
Sort<true>/stable/16777216/8/manual_time +0.0028 +0.0028 876 878 875 878
Sort<true>/stable/67108864/8/manual_time -0.0016 -0.0016 5021 5013 5021 5013
Sort/strings/1024/manual_time -0.0176 -0.0174 1 1 1 1
Sort/strings/4096/manual_time +0.0053 +0.0054 2 2 2 2
Sort/strings/32768/manual_time +0.0012 +0.0012 2 2 2 2
Sort/strings/262144/manual_time -0.0066 -0.0066 8 8 8 8
Sort/strings/2097152/manual_time -0.0106 -0.0106 95 94 95 94
Sort/strings/16777216/manual_time -0.0183 -0.0183 862 846 862 846
Some of my changes helped the struct performance a little bit, and the primitive types are no longer seeing any slowdown from this PR. The struct performance slowing down is fine IMO. The alternative is adding more template specializations to branch between only-list, only-struct, and list-and-struct comparators, which is more complexity than is merited. I think usage of any nested type can incur the performance hit for all nested types and that's OK.
There are still some pieces of this code that aren't the cleanest and could use improvement. In particular, lexicographic::device_row_comparator is now quite different from equality::device_row_comparator due to the templating, and use of the lex comparator now requires all calling code to check whether nested columns are present or not. There are a couple of ways that I can think of cleaning up this code, but given how long this PR has been sitting I would prefer that we merge this as is (assuming reviewers are happy with other aspects of the code in its current state) and address those cleanup tasks in follow-up PRs.
The struct performance slowing down is fine IMO.
I'm starting to review this PR. The statement above makes me a bit worried. There are many cases comparing non-nested structs (struct of basic types) in Spark using the flattening approach + basic lex comparator. Switching to using the new lex comparator for non-nested structs could cause a lot of performance regression. The Spark team may scream in panic if they see their benchmarks slow down due to it.
As per my argument above, if this PR changes the performance of the existing APIs (like binary search and sorting) then I would suggest not to merge this until it has been tested with the performance benchmark in spark-rapids and got confirmation of a safe-merge.
As per my argument above, if this PR changes the performance of the existing APIs (like binary search and sorting) then I would suggest not to merge this until it has been tested with the performance benchmark in spark-rapids and got confirmation of a safe-merge.
This makes me sad, but it is a good point. I agree that we probably want to get this tested by spark-rapids. Can you follow up with the appropriate person (perhaps @abellina?) to get those benchmarks run?
The old comparator will eventually be phased out entirely in favor of the new comparator. If there are cases where the new comparator's performance measures up poorly against the old one, then we'll have to consider ways to optimize it. I take your point about the new comparator perhaps slowing down the case of structs that don't contain other structs, but I'm honestly not sure how far we want to go here. Truly providing peak performance for every possible combination would probably require us to have many different template overloads of this type, each with different operator overloads enabled or disabled from compilation (lists without structs, structs without lists, no lists or structs, different types of mixed structs/lists) and perhaps even based on whether or not nulls are present. Going down this route will, I suspect, very quickly start to incur a much higher maintenance burden than what we currently have since these performance characteristics are also likely to change with each compiler version. In my view it is expected that enabling more features for comparing nested types could have performance costs, and to some extent that is acceptable.
CC @GregoryKimball
@gpucibot merge