cudf icon indicating copy to clipboard operation
cudf copied to clipboard

List lexicographic comparator

Open devavret opened this issue 3 years ago • 16 comments

Contributes to #10186

This PR enables lexicographic comparisons between list columns. The comparison is robust to arbitrary levels of nesting, but does not yet support lists of (lists of...) structs. The comparison is based on the Dremel encoding already in use in the Parquet file format. To assist reviewers, here is a reasonably complete list of the changes in this PR:

  1. A helper function to get per-column Dremel data (for list columns) when constructing a preprocessed table, which now owns the Dremel data.
  2. Updating the set of lexicographically compatible columns to now include list columns as long as they do not have any nested structs within.
  3. An implementation of lexicographic::device_row_comparator::operator() for lists. This function is the heart of the change to enable comparisons between list columns.
  4. A new benchmark for sorting that uses list data.
  5. An update to a preexisting rolling collect set test that previously failed (because it requires list comparison) but now works.
  6. New tests for list comparison.

devavret avatar Jun 21 '22 19:06 devavret

@bdice @ttnghia I think this is ready for another round of review now. I believe that I've addressed the review requests.

vyasr avatar Jul 28 '22 00:07 vyasr

Codecov Report

:exclamation: No coverage uploaded for pull request base (branch-22.10@03f3543). Click here to learn what that means. Patch has no changes to coverable lines.

Additional details and impacted files
@@               Coverage Diff               @@
##             branch-22.10   #11129   +/-   ##
===============================================
  Coverage                ?   86.41%           
===============================================
  Files                   ?      145           
  Lines                   ?    23000           
  Branches                ?        0           
===============================================
  Hits                    ?    19876           
  Misses                  ?     3124           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov[bot] avatar Jul 28 '22 01:07 codecov[bot]

I wanted to post a couple nice links about Dremel encoding, since this is a core part of this PR.

Summary of the encoding with nice diagrams: https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet Original paper: https://research.google/pubs/pub36632/

The short description is that repetition levels encode hierarchy (nested depth), and definition levels encode nullity.

bdice avatar Jul 28 '22 17:07 bdice

I didn't review all code yet. Given the time constraint, and no C++ reviewers approved yet, I'm a bit worried if we can merge this by tomorrow as a careful review may take much more time in several more review iterations. Although the code is already good, there may be some nits and optimizations that need to be addressed.

I'm totally fine with merging this first then polish later in a follow-up PR, given that all unit tests in this PR pass. If other people agree, let's concentrate all the remaining time on improving test coverage. I understand that this is a bit risky thus it may or may not be accepted. So in the meantime, I'll continue reviewing all the code.

ttnghia avatar Jul 28 '22 17:07 ttnghia

I wanted to post a couple nice links about Dremel encoding, since this is a core part of this PR.

Summary of the encoding with nice diagrams: https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet Original paper: https://research.google/pubs/pub36632/

The short description is that repetition levels encode hierarchy (nested depth), and definition levels encode nullity.

@bdice FYI I tossed a few similar links into the docstring for get_dremel_data.

vyasr avatar Jul 28 '22 20:07 vyasr

To summarize the results of our offline discussion:

  • We are going to push this PR to 22.10 to avoid any unforeseen performance implications for other code paths that are already using experimental::lexicographic::[self|two_table]_comparator (e.g. struct binaryops).
  • All refactoring associated with code in dremel.hpp or dremel.cu will be done in a separate PR after this PR is merged. I will leave any review comments on this PR requesting that kind of refactoring open so that those tasks can be addressed and the conversations resolved later (for tracking purposes).
  • The primary remaining focus of this PR is adding more tests to validate the behavior of this code on edge cases and benchmarking thoroughly to understand the performance implications of these changes.

vyasr avatar Jul 29 '22 00:07 vyasr

Here are the benchmarks results for sorting structs:

Benchmarks

|  NumRows  |  Depth  |  Nulls  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|-----------|---------|---------|------------|-------------|------------|-------------|------------|---------|----------|
|   2^10    |    0    |    0    |  63.517 us |      73.30% |  48.075 us |       2.06% | -15.442 us | -24.31% |   FAIL   |
|   2^18    |    0    |    0    | 171.454 us |       1.24% | 177.255 us |       1.14% |   5.800 us |   3.38% |   FAIL   |
|   2^26    |    0    |    0    |  27.694 ms |       0.06% |  27.711 ms |       0.12% |  16.778 us |   0.06% |   FAIL   |
|   2^10    |    1    |    0    | 284.587 us |      12.50% | 361.627 us |       0.35% |  77.040 us |  27.07% |   FAIL   |
|   2^18    |    1    |    0    |   1.196 ms |       0.29% |   2.164 ms |       1.76% | 967.938 us |  80.95% |   FAIL   |
|   2^26    |    1    |    0    | 332.593 ms |       0.70% | 660.864 ms |       0.33% | 328.271 ms |  98.70% |   FAIL   |
|   2^10    |    8    |    0    | 668.827 us |       0.18% | 668.484 us |       0.35% |  -0.343 us |  -0.05% |   PASS   |
|   2^18    |    8    |    0    |   3.317 ms |       0.26% |   4.268 ms |       1.43% | 951.289 us |  28.68% |   FAIL   |
|   2^26    |    8    |    0    | 830.883 ms |       0.26% |    1.212 s |       0.34% | 381.604 ms |  45.93% |   FAIL   |
|   2^10    |    0    |    1    |  82.359 us |       0.82% |  81.364 us |       0.94% |  -0.995 us |  -1.21% |   FAIL   |
|   2^18    |    0    |    1    | 485.130 us |       0.86% | 504.918 us |       6.73% |  19.787 us |   4.08% |   FAIL   |
|   2^26    |    0    |    1    | 188.069 ms |       0.57% | 195.386 ms |       1.01% |   7.318 ms |   3.89% |   FAIL   |
|   2^10    |    1    |    1    | 299.367 us |       0.36% | 377.542 us |       0.39% |  78.175 us |  26.11% |   FAIL   |
|   2^18    |    1    |    1    |   1.215 ms |       0.48% |   2.188 ms |       2.20% | 972.830 us |  80.06% |   FAIL   |
|   2^26    |    1    |    1    | 339.462 ms |       0.34% | 667.672 ms |       0.50% | 328.210 ms |  96.69% |   FAIL   |
|   2^10    |    8    |    1    | 680.116 us |       4.05% | 682.444 us |       0.37% |   2.328 us |   0.34% |   PASS   |
|   2^18    |    8    |    1    |   3.304 ms |       0.46% |   4.293 ms |       1.14% | 988.216 us |  29.91% |   FAIL   |
|   2^26    |    8    |    1    | 839.208 ms |       0.27% |    1.209 s |       0.15% | 369.619 ms |  44.04% |   FAIL   |

and for sorting primitive types:

Benchmarks

Comparing benches/SORT_BENCH_branch-22.10.json to benches/SORT_BENCH_list_lex_branch.json                                                                                                                  
Benchmark                                                              Time             CPU      Time Old      Time New       CPU Old       CPU New                                                        
---------------------------------------------------------------------------------------------------------------------------------------------------                                                        
Sort<false>/unstable_no_nulls/1024/1/manual_time                    -0.0065         -0.0051             0             0             0             0                                                        
Sort<false>/unstable_no_nulls/4096/1/manual_time                    +0.0020         +0.0003             0             0             0             0                                                        
Sort<false>/unstable_no_nulls/32768/1/manual_time                   +0.0031         +0.0020             0             0             0             0                                                        
Sort<false>/unstable_no_nulls/262144/1/manual_time                  -0.0026         -0.0016             0             0             0             0  
Sort<false>/unstable_no_nulls/2097152/1/manual_time                 +0.0008         +0.0006             1             1             1             1
Sort<false>/unstable_no_nulls/16777216/1/manual_time                +0.0014         +0.0014             7             7             7             7
Sort<false>/unstable_no_nulls/67108864/1/manual_time                +0.0004         +0.0004            28            28            28            28
Sort<false>/unstable_no_nulls/1024/8/manual_time                    +0.3182         +0.3066             1             1             1             1
Sort<false>/unstable_no_nulls/4096/8/manual_time                    +0.6280         +0.6121             1             2             1             2
Sort<false>/unstable_no_nulls/32768/8/manual_time                   +0.8153         +0.7980             1             2             1             2
Sort<false>/unstable_no_nulls/262144/8/manual_time                  +0.8925         +0.8856             3             6             3             6
Sort<false>/unstable_no_nulls/2097152/8/manual_time                 +0.4395         +0.4393            53            76            53            76
Sort<false>/unstable_no_nulls/16777216/8/manual_time                +0.3813         +0.3813           673           929           673           929
Sort<false>/unstable_no_nulls/67108864/8/manual_time                +0.3413         +0.3413          3506          4703          3506          4702
Sort<true>/stable_no_nulls/1024/1/manual_time                       +0.0045         +0.0006             0             0             0             0
Sort<true>/stable_no_nulls/4096/1/manual_time                       -0.0028         -0.0036             0             0             0             0
Sort<true>/stable_no_nulls/32768/1/manual_time                      -0.0003         -0.0008             0             0             0             0
Sort<true>/stable_no_nulls/262144/1/manual_time                     -0.0027         -0.0014             0             0             0             0
Sort<true>/stable_no_nulls/2097152/1/manual_time                    +0.0010         +0.0010             1             1             1             1
Sort<true>/stable_no_nulls/16777216/1/manual_time                   +0.0002         +0.0002             7             7             7             7
Sort<true>/stable_no_nulls/67108864/1/manual_time                   +0.0006         +0.0006            28            28            28            28
Sort<true>/stable_no_nulls/1024/8/manual_time                       +0.3178         +0.3065             1             1             1             1
Sort<true>/stable_no_nulls/4096/8/manual_time                       +0.6353         +0.6198             1             2             1             2
Sort<true>/stable_no_nulls/32768/8/manual_time                      +0.8233         +0.8060             1             2             1             2
Sort<true>/stable_no_nulls/262144/8/manual_time                     +0.8936         +0.8867             3             6             3             6
Sort<true>/stable_no_nulls/2097152/8/manual_time                    +0.4424         +0.4422            53            77            53            77
Sort<true>/stable_no_nulls/16777216/8/manual_time                   +0.3858         +0.3858           673           932           673           932
Sort<true>/stable_no_nulls/67108864/8/manual_time                   +0.3435         +0.3435          3507          4711          3507          4711
Sort<false>/unstable/1024/1/manual_time                             -0.0004         -0.0007             0             0             0             0
Sort<false>/unstable/4096/1/manual_time                             +0.0124         +0.0109             0             0             0             0
Sort<false>/unstable/32768/1/manual_time                            +0.0381         +0.0337             0             0             0             0
Sort<false>/unstable/262144/1/manual_time                           -0.0168         -0.0160             0             0             1             1
Sort<false>/unstable/2097152/1/manual_time                          -0.0183         -0.0183             5             5             5             5
Sort<false>/unstable/16777216/1/manual_time                         -0.0119         -0.0119            44            43            44            43
Sort<false>/unstable/67108864/1/manual_time                         -0.0147         -0.0147           182           180           183           180
Sort<false>/unstable/1024/8/manual_time                             +0.3256         +0.3155             1             1             1             1
Sort<false>/unstable/4096/8/manual_time                             +0.5155         +0.5053             1             2             1             2
Sort<false>/unstable/32768/8/manual_time                            +0.6511         +0.6408             2             3             2             3
Sort<false>/unstable/262144/8/manual_time                           +0.7431         +0.7385             4             7             4             7
Sort<false>/unstable/2097152/8/manual_time                          +0.4393         +0.4391            61            87            61            87
Sort<false>/unstable/16777216/8/manual_time                         +0.3519         +0.3519           875          1183           875          1183
Sort<false>/unstable/67108864/8/manual_time                         +0.2847         +0.2847          5017          6446          5017          6445
Sort<true>/stable/1024/1/manual_time                                +0.0004         -0.0023             0             0             0             0
Sort<true>/stable/4096/1/manual_time                                +0.0094         +0.0086             0             0             0             0
Sort<true>/stable/32768/1/manual_time                               +0.0369         +0.0323             0             0             0             0
Sort<true>/stable/262144/1/manual_time                              -0.0108         -0.0106             1             0             1             1
Sort<true>/stable/2097152/1/manual_time                             -0.0152         -0.0152             6             5             6             5
Sort<true>/stable/16777216/1/manual_time                            -0.0138         -0.0138            44            44            45            44
Sort<true>/stable/67108864/1/manual_time                            -0.0112         -0.0112           184           182           184           182
Sort<true>/stable/1024/8/manual_time                                +0.3161         +0.3064             1             1             1             1
Sort<true>/stable/4096/8/manual_time                                +0.5174         +0.5076             1             2             1             2
Sort<true>/stable/32768/8/manual_time                               +0.6486         +0.6384             2             3             2             3
Sort<true>/stable/262144/8/manual_time                              +0.7375         +0.7330             4             7             4             7
Sort<true>/stable/2097152/8/manual_time                             +0.4439         +0.4437            61            88            61            88
Sort<true>/stable/16777216/8/manual_time                            +0.3558         +0.3558           876          1187           875          1187
Sort<true>/stable/67108864/8/manual_time                            +0.2855         +0.2855          5021          6455          5021          6454
Sort/strings/1024/manual_time                                       -0.0066         -0.0066             1             1             1             1
Sort/strings/4096/manual_time                                       +0.0040         +0.0039             2             2             2             2
Sort/strings/32768/manual_time                                      -0.0001         -0.0001             2             2             2             2
Sort/strings/262144/manual_time                                     -0.0069         -0.0069             8             8             8             8
Sort/strings/2097152/manual_time                                    -0.0068         -0.0068            95            94            95            94
Sort/strings/16777216/manual_time                                   -0.0089         -0.0089           862           854           862           854

~I think we can safely say that primitive types are unaffected~, but the struct results certainly look suspicious. This probably merits some further investigation, since I could easily have done something wrong, but for now I'm glad we're not pushing this through at this moment.

EDIT: The claim about primitive types is completely bogus. I don't know what I did the first time around, probably forgot to recompile, but based on multiple checks now primitive type performance definitely is affected.

vyasr avatar Jul 29 '22 01:07 vyasr

It does indeed look like simply commenting out the list_view specialization of the code makes the struct code path faster. The overhead of constructing the Dremel data is negligible, I can reproduce this with just commenting out that specialization. I'll investigate further to see if there are inlining issues.

vyasr avatar Jul 29 '22 15:07 vyasr

It does indeed look like simply commenting out the list_view specialization of the code makes the struct code path faster. The overhead of constructing the Dremel data is negligible, I can reproduce this with just commenting out that specialization. I'll investigate further to see if there are inlining issues.

I'm guessing that code path is inflating your register usage and you're falling off an occupancy ledge.

jrhemstad avatar Jul 29 '22 21:07 jrhemstad

It does indeed look like simply commenting out the list_view specialization of the code makes the struct code path faster. The overhead of constructing the Dremel data is negligible, I can reproduce this with just commenting out that specialization. I'll investigate further to see if there are inlining issues.

I'm guessing that code path is inflating your register usage and you're falling off an occupancy ledge.

Yeah that's part of what I'm seeing, it balloons register usage by about 30-40. It also substantially increases the stack usage. Since this is all using thrust algorithms, and given the nested structures of the comparators, it's a bit tricky to test approaches to alleviate the issue (although of course we can't prevent it entirely). We can't control the launch bounds for the kernel directly on a thrust invocation. Furthermore, since we see increased register and stack usage it seems like any change around forceinlining is going to be trading off between these two effects unless the compiler inlines code in a way that simultaneously reduces the stack and allows more efficient reuse of registers. I'm hunting for whether there's maybe more const data that could be passed correctly to help the compiler reduce the amount of local state, and whether there's ways to rewrite the internals of the operator in a way that gives the compiler more hints as to how much local state really needs to be maintained.

vyasr avatar Jul 29 '22 21:07 vyasr

I spent some time today investigating different versions of the compilation and got very suspicious at how my benchmarks could possibly have shown that primitive types were unaffected when every version of the kernel looked much more resource-hungry than before. After running through the benchmarks again multiple times today, there was definitely something wrong with how I ran the benchmarks the first time. The new code path definitely adversely affects performance for primitive types. I'm not sure what I did the first time around, but this result definitely makes more sense given that the complexity of the kernel is higher by default for all types. I'm going to do some cleanup to address the outstanding comments while I brainstorm potential solutions, but we definitely do not want to merge this in its current form.

vyasr avatar Aug 04 '22 00:08 vyasr

To minimize the amount of friction between this PR, ongoing Parquet development, and the desire to refactor Dremel encoding code, I have separated out the Dremel changes into #11461. We can review and merge that separately, then merge those changes down into this PR so that we have a minimal relevant changeset here to evaluate.

vyasr avatar Aug 04 '22 00:08 vyasr

The new code path definitely adversely affects performance for primitive types.

How much are we talking?

jrhemstad avatar Aug 04 '22 02:08 jrhemstad

The comparison results are in this earlier comment. I always put benchmark results into dropdowns that look like

Benchmarks

Benchmarks go here

because I find GH PR conversations filled with benchmark results to be too cluttered to be readable, but maybe I'm just hiding them entirely. Did you (or any other reviewers) even notice that was there?

vyasr avatar Aug 04 '22 15:08 vyasr

So I've tried giving the compiler more explicit hints that the list and struct comparator code paths will never call to each other (e.g. by separating out the primitive comparison logic into a separate type so that no branch of the element_comparator ever creates other element_comparator instances), but no luck so far. It seems like dispatch_void_if_nested is doing its job. I'm now working through addressing review comments and simplifying the code in hopes of reducing register usage that way, but haven't had much luck. The total complexity of the device function just seems too high now.

Our best option at the moment seems like separating out entirely the code paths for nested and non-nested data. This approach works, but has a number of downsides: -We will have two separate types of comparators and it'll be up to calling code to decide whether to instantiate the constructor that supports nested types or not.

  • The lexicographic comparators will now diverge from our equality comparators since there won't just be a single class anymore.
  • Calling code will now have to determine which comparator to call beforehand, so anything using these comparators will now become a little more complex.

I would have been fine with this approach if it was clearly temporary, but at the moment I don't have an alternate solution to propose, so it's not like this would be a temporary patch awaiting some slightly more complex and time-consuming fix. If we go with the split comparator approach, I think we have to assume that it's not a temporary solution, and then if someone comes up with an alternative in the future we consider ourselves lucky. That makes me significantly more uneasy.

vyasr avatar Aug 04 '22 19:08 vyasr

I'm afraid that I wasn't quite able to get through this today. There are two main outstanding tasks here:

  1. Split the device_row_comparator into two implementations, one that supports nesting and one that does not. This also needs to propagate to the less method of the self_comparator and the two_table_comparator.
  2. Find a better way to pass along the dremel_device_view vector that handles the non-list columns. Currently non-list columns default construct the struct, which places it in an invalid state, and we just rely on the comparator never dereferencing those elements of the vector. We should probably use something like a thrust::optional here, but I had issues getting device_span and thrust::optional to play nicely together.

vyasr avatar Aug 05 '22 22:08 vyasr

I plan to pick this back up this week.

vyasr avatar Aug 29 '22 22:08 vyasr

OK, I've managed to resolve the two major concerns above.

Here are the new benchmarks for structs:

Benchmarks

## [0] Tesla T4

|  NumRows  |  Depth  |  Nulls  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|-----------|---------|---------|------------|-------------|------------|-------------|------------|---------|----------|
|   2^10    |    0    |    0    |  63.517 us |      73.30% |  50.508 us |       3.83% | -13.010 us | -20.48% |   FAIL   |
|   2^18    |    0    |    0    | 171.454 us |       1.24% | 172.818 us |       1.42% |   1.364 us |   0.80% |   PASS   |
|   2^26    |    0    |    0    |  27.694 ms |       0.06% |  27.732 ms |       0.10% |  38.183 us |   0.14% |   FAIL   |
|   2^10    |    1    |    0    | 284.587 us |      12.50% | 367.027 us |       0.72% |  82.441 us |  28.97% |   FAIL   |
|   2^18    |    1    |    0    |   1.196 ms |       0.29% |   1.919 ms |       1.23% | 723.708 us |  60.52% |   FAIL   |
|   2^26    |    1    |    0    | 332.593 ms |       0.70% | 569.225 ms |       0.49% | 236.632 ms |  71.15% |   FAIL   |
|   2^10    |    8    |    0    | 668.827 us |       0.18% | 675.271 us |       0.65% |   6.444 us |   0.96% |   FAIL   |
|   2^18    |    8    |    0    |   3.317 ms |       0.26% |   3.954 ms |       1.25% | 637.735 us |  19.23% |   FAIL   |
|   2^26    |    8    |    0    | 830.883 ms |       0.26% |    1.096 s |       0.28% | 265.034 ms |  31.90% |   FAIL   |
|   2^10    |    0    |    1    |  82.359 us |       0.82% |  82.959 us |       1.91% |   0.601 us |   0.73% |   PASS   |
|   2^18    |    0    |    1    | 485.130 us |       0.86% | 502.331 us |       1.43% |  17.201 us |   3.55% |   FAIL   |
|   2^26    |    0    |    1    | 188.069 ms |       0.57% | 194.755 ms |       0.93% |   6.687 ms |   3.56% |   FAIL   |
|   2^10    |    1    |    1    | 299.367 us |       0.36% | 380.822 us |       0.55% |  81.455 us |  27.21% |   FAIL   |
|   2^18    |    1    |    1    |   1.215 ms |       0.48% |   1.953 ms |       1.44% | 737.560 us |  60.70% |   FAIL   |
|   2^26    |    1    |    1    | 339.462 ms |       0.34% | 578.125 ms |       0.46% | 238.664 ms |  70.31% |   FAIL   |
|   2^10    |    8    |    1    | 680.116 us |       4.05% | 688.269 us |       0.53% |   8.153 us |   1.20% |   FAIL   |
|   2^18    |    8    |    1    |   3.304 ms |       0.46% |   3.997 ms |       1.25% | 692.167 us |  20.95% |   FAIL   |
|   2^26    |    8    |    1    | 839.208 ms |       0.27% |    1.100 s |       0.17% | 261.256 ms |  31.13% |   FAIL   |

and for sorting primitive types:

Benchmarks

Benchmark                                                              Time             CPU      Time Old      Time New       CPU Old       CPU New
---------------------------------------------------------------------------------------------------------------------------------------------------
Sort<false>/unstable_no_nulls/1024/1/manual_time                    +0.0047         +0.0054             0             0             0             0
Sort<false>/unstable_no_nulls/4096/1/manual_time                    +0.0180         +0.0131             0             0             0             0
Sort<false>/unstable_no_nulls/32768/1/manual_time                   -0.0016         +0.0017             0             0             0             0
Sort<false>/unstable_no_nulls/262144/1/manual_time                  -0.0311         -0.0255             0             0             0             0
Sort<false>/unstable_no_nulls/2097152/1/manual_time                 -0.0011         -0.0007             1             1             1             1
Sort<false>/unstable_no_nulls/16777216/1/manual_time                +0.0014         +0.0015             7             7             7             7
Sort<false>/unstable_no_nulls/67108864/1/manual_time                +0.0003         +0.0003            28            28            28            28
Sort<false>/unstable_no_nulls/1024/8/manual_time                    -0.0333         -0.0321             1             1             1             1
Sort<false>/unstable_no_nulls/4096/8/manual_time                    -0.0625         -0.0608             1             1             1             1
Sort<false>/unstable_no_nulls/32768/8/manual_time                   -0.0439         -0.0427             1             1             1             1
Sort<false>/unstable_no_nulls/262144/8/manual_time                  -0.0212         -0.0210             3             3             3             3
Sort<false>/unstable_no_nulls/2097152/8/manual_time                 -0.0090         -0.0090            53            53            53            53
Sort<false>/unstable_no_nulls/16777216/8/manual_time                -0.0016         -0.0016           673           672           673           672
Sort<false>/unstable_no_nulls/67108864/8/manual_time                -0.0005         -0.0006          3506          3504          3506          3504
Sort<true>/stable_no_nulls/1024/1/manual_time                       +0.0139         +0.0073             0             0             0             0
Sort<true>/stable_no_nulls/4096/1/manual_time                       +0.0145         +0.0103             0             0             0             0
Sort<true>/stable_no_nulls/32768/1/manual_time                      +0.0075         +0.0078             0             0             0             0
Sort<true>/stable_no_nulls/262144/1/manual_time                     -0.0300         -0.0243             0             0             0             0
Sort<true>/stable_no_nulls/2097152/1/manual_time                    -0.0027         -0.0024             1             1             1             1
Sort<true>/stable_no_nulls/16777216/1/manual_time                   +0.0028         +0.0029             7             7             7             7
Sort<true>/stable_no_nulls/67108864/1/manual_time                   +0.0003         +0.0003            28            28            28            28
Sort<true>/stable_no_nulls/1024/8/manual_time                       -0.0329         -0.0316             1             1             1             1
Sort<true>/stable_no_nulls/4096/8/manual_time                       -0.0573         -0.0555             1             1             1             1
Sort<true>/stable_no_nulls/32768/8/manual_time                      -0.0399         -0.0388             1             1             1             1
Sort<true>/stable_no_nulls/262144/8/manual_time                     -0.0226         -0.0224             3             3             3             3
Sort<true>/stable_no_nulls/2097152/8/manual_time                    -0.0069         -0.0069            53            53            53            53
Sort<true>/stable_no_nulls/16777216/8/manual_time                   -0.0002         -0.0001           673           672           673           672
Sort<true>/stable_no_nulls/67108864/8/manual_time                   -0.0007         -0.0007          3507          3504          3507          3504
Sort<false>/unstable/1024/1/manual_time                             +0.0016         +0.0032             0             0             0             0
Sort<false>/unstable/4096/1/manual_time                             +0.0191         +0.0183             0             0             0             0
Sort<false>/unstable/32768/1/manual_time                            +0.0431         +0.0414             0             0             0             0
Sort<false>/unstable/262144/1/manual_time                           -0.0219         -0.0201             0             0             1             1
Sort<false>/unstable/2097152/1/manual_time                          -0.0212         -0.0211             5             5             5             5
Sort<false>/unstable/16777216/1/manual_time                         -0.0200         -0.0200            44            43            44            43
Sort<false>/unstable/67108864/1/manual_time                         -0.0173         -0.0173           182           179           183           179
Sort<false>/unstable/1024/8/manual_time                             -0.0048         -0.0047             1             1             1             1
Sort<false>/unstable/4096/8/manual_time                             -0.0368         -0.0362             1             1             1             1
Sort<false>/unstable/32768/8/manual_time                            -0.0155         -0.0151             2             2             2             2
Sort<false>/unstable/262144/8/manual_time                           -0.0211         -0.0210             4             4             4             4
Sort<false>/unstable/2097152/8/manual_time                          -0.0061         -0.0061            61            60            61            60
Sort<false>/unstable/16777216/8/manual_time                         +0.0028         +0.0028           875           878           875           877
Sort<false>/unstable/67108864/8/manual_time                         -0.0018         -0.0018          5017          5008          5017          5008
Sort<true>/stable/1024/1/manual_time                                +0.0039         +0.0024             0             0             0             0
Sort<true>/stable/4096/1/manual_time                                +0.0191         +0.0173             0             0             0             0
Sort<true>/stable/32768/1/manual_time                               +0.0409         +0.0384             0             0             0             0
Sort<true>/stable/262144/1/manual_time                              -0.0197         -0.0183             1             0             1             1
Sort<true>/stable/2097152/1/manual_time                             -0.0214         -0.0214             6             5             6             5
Sort<true>/stable/16777216/1/manual_time                            -0.0186         -0.0186            44            44            45            44
Sort<true>/stable/67108864/1/manual_time                            -0.0177         -0.0177           184           181           184           181
Sort<true>/stable/1024/8/manual_time                                -0.0092         -0.0091             1             1             1             1
Sort<true>/stable/4096/8/manual_time                                -0.0374         -0.0367             1             1             1             1
Sort<true>/stable/32768/8/manual_time                               -0.0172         -0.0168             2             2             2             2
Sort<true>/stable/262144/8/manual_time                              -0.0201         -0.0200             4             4             4             4
Sort<true>/stable/2097152/8/manual_time                             -0.0048         -0.0048            61            61            61            61
Sort<true>/stable/16777216/8/manual_time                            +0.0028         +0.0028           876           878           875           878
Sort<true>/stable/67108864/8/manual_time                            -0.0016         -0.0016          5021          5013          5021          5013
Sort/strings/1024/manual_time                                       -0.0176         -0.0174             1             1             1             1
Sort/strings/4096/manual_time                                       +0.0053         +0.0054             2             2             2             2
Sort/strings/32768/manual_time                                      +0.0012         +0.0012             2             2             2             2
Sort/strings/262144/manual_time                                     -0.0066         -0.0066             8             8             8             8
Sort/strings/2097152/manual_time                                    -0.0106         -0.0106            95            94            95            94
Sort/strings/16777216/manual_time                                   -0.0183         -0.0183           862           846           862           846

Some of my changes helped the struct performance a little bit, and the primitive types are no longer seeing any slowdown from this PR. The struct performance slowing down is fine IMO. The alternative is adding more template specializations to branch between only-list, only-struct, and list-and-struct comparators, which is more complexity than is merited. I think usage of any nested type can incur the performance hit for all nested types and that's OK.

There are still some pieces of this code that aren't the cleanest and could use improvement. In particular, lexicographic::device_row_comparator is now quite different from equality::device_row_comparator due to the templating, and use of the lex comparator now requires all calling code to check whether nested columns are present or not. There are a couple of ways that I can think of cleaning up this code, but given how long this PR has been sitting I would prefer that we merge this as is (assuming reviewers are happy with other aspects of the code in its current state) and address those cleanup tasks in follow-up PRs.

vyasr avatar Sep 02 '22 22:09 vyasr

The struct performance slowing down is fine IMO.

I'm starting to review this PR. The statement above makes me a bit worried. There are many cases comparing non-nested structs (struct of basic types) in Spark using the flattening approach + basic lex comparator. Switching to using the new lex comparator for non-nested structs could cause a lot of performance regression. The Spark team may scream in panic if they see their benchmarks slow down due to it.

ttnghia avatar Sep 08 '22 18:09 ttnghia

As per my argument above, if this PR changes the performance of the existing APIs (like binary search and sorting) then I would suggest not to merge this until it has been tested with the performance benchmark in spark-rapids and got confirmation of a safe-merge.

ttnghia avatar Sep 08 '22 18:09 ttnghia

As per my argument above, if this PR changes the performance of the existing APIs (like binary search and sorting) then I would suggest not to merge this until it has been tested with the performance benchmark in spark-rapids and got confirmation of a safe-merge.

This makes me sad, but it is a good point. I agree that we probably want to get this tested by spark-rapids. Can you follow up with the appropriate person (perhaps @abellina?) to get those benchmarks run?

The old comparator will eventually be phased out entirely in favor of the new comparator. If there are cases where the new comparator's performance measures up poorly against the old one, then we'll have to consider ways to optimize it. I take your point about the new comparator perhaps slowing down the case of structs that don't contain other structs, but I'm honestly not sure how far we want to go here. Truly providing peak performance for every possible combination would probably require us to have many different template overloads of this type, each with different operator overloads enabled or disabled from compilation (lists without structs, structs without lists, no lists or structs, different types of mixed structs/lists) and perhaps even based on whether or not nulls are present. Going down this route will, I suspect, very quickly start to incur a much higher maintenance burden than what we currently have since these performance characteristics are also likely to change with each compiler version. In my view it is expected that enabling more features for comparing nested types could have performance costs, and to some extent that is acceptable.

CC @GregoryKimball

vyasr avatar Sep 08 '22 21:09 vyasr

@gpucibot merge

vyasr avatar Sep 12 '22 16:09 vyasr