cudf
cudf copied to clipboard
Fully support nested types in `cudf::contains`
This extends the cudf::contains API to support nested types (lists + structs) with arbitrarily nested levels. As such, cudf::contains will work with literally any type of input data.
In addition, this fixes null handling of cudf::contains with structs column + struct scalar input when the structs column contains null rows at the top level while the scalar key is valid but all nulls at children levels.
Closes: https://github.com/rapidsai/cudf/issues/8965 Depends on:
- https://github.com/rapidsai/cudf/pull/10730
- https://github.com/rapidsai/cudf/pull/10883
- https://github.com/rapidsai/cudf/pull/10802
- https://github.com/rapidsai/cudf/pull/10997
- https://github.com/NVIDIA/cuCollections/issues/172
- https://github.com/NVIDIA/cuCollections/issues/173
- https://github.com/rapidsai/cudf/issues/11037
- https://github.com/rapidsai/cudf/pull/11356
Codecov Report
:exclamation: No coverage uploaded for pull request base (
branch-22.10@abd4302). Click here to learn what that means. The diff coverage isn/a.
:exclamation: Current head 42d75b1 differs from pull request most recent head 28f115f. Consider uploading reports for the commit 28f115f to get more accurate results
@@ Coverage Diff @@
## branch-22.10 #10656 +/- ##
===============================================
Coverage ? 86.36%
===============================================
Files ? 145
Lines ? 22949
Branches ? 0
===============================================
Hits ? 19820
Misses ? 3129
Partials ? 0
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
Currently hold off due to dependency. However, a temporary implementation could be be pushed just for testing, and will be replaced by a new implementation before the PR is ready for review.
Sorry for the misclicking. Just wanted to check some of the commits...
Due to time constraint, we decided to retarget this into 22.08.
Here is the benchmark regarding to https://github.com/rapidsai/cudf/pull/10656#discussion_r886218927, comparing two approach:
- Insert all haystack indices into the hash map: no row comparison
- Insert only indices of the unique rows of haystack: row equality comparisons using
self_comparator
The benchmarks were on structs column of 8 children (4 int children + 4 string children) generated randomly.
From the results, we can observe that without (much) duplicates the first approach is faster as it doesn't do row comparisons. With much duplicates (>75%) the second approach is faster while the first approach becomes slower due to hash collision and linear probe.
Haystack column with almost no duplicates
Benchmark Time CPU Time Old Time New CPU Old CPU New
-----------------------------------------------------------------------------------------------------------------------------------------
Search/ColumnContains/1024/manual_time +0.2861 +0.2749 0 1 0 1
Search/ColumnContains/4096/manual_time +0.2424 +0.2330 0 1 0 1
Search/ColumnContains/32768/manual_time +0.2863 +0.2782 0 1 1 1
Search/ColumnContains/262144/manual_time +0.3013 +0.3163 1 2 1 2
Search/ColumnContains/2097152/manual_time +0.3324 +0.3322 8 10 8 10
Search/ColumnContains/16777216/manual_time +0.3485 +0.3445 61 82 61 82
Search/ColumnContains/67108864/manual_time +0.3367 +0.3439 247 330 254 341
Haystack column repeated twice (>50% duplicates)
Benchmark Time CPU Time Old Time New CPU Old CPU New
-----------------------------------------------------------------------------------------------------------------------------------------
Search/ColumnContains/1024/manual_time +0.1489 +0.1395 0 0 0 0
Search/ColumnContains/4096/manual_time +0.2634 +0.2560 0 1 0 1
Search/ColumnContains/32768/manual_time +0.0975 +0.0890 1 1 1 1
Search/ColumnContains/262144/manual_time +0.2431 +0.2401 1 1 1 1
Search/ColumnContains/2097152/manual_time +0.2031 +0.2029 7 9 7 9
Search/ColumnContains/16777216/manual_time +0.1763 +0.1762 56 66 56 66
Search/ColumnContains/67108864/manual_time +0.1098 +0.1095 243 269 243 270
Haystack column repeated 4 times (>75% duplicates)
Benchmark Time CPU Time Old Time New CPU Old CPU New
-----------------------------------------------------------------------------------------------------------------------------------------
Search/ColumnContains/1024/manual_time +0.0650 +0.0569 0 0 0 0
Search/ColumnContains/4096/manual_time +0.1890 +0.1795 0 1 0 1
Search/ColumnContains/32768/manual_time +0.1281 +0.1229 1 1 1 1
Search/ColumnContains/262144/manual_time +0.0550 +0.0530 1 1 1 1
Search/ColumnContains/2097152/manual_time -0.0520 -0.0582 8 8 8 8
Search/ColumnContains/16777216/manual_time -0.1328 -0.1329 69 60 69 60
Search/ColumnContains/67108864/manual_time -0.1936 -0.1944 298 240 300 241
Code for benchmark
#include <cudf/column/column_factories.hpp> #include <cudf/concatenate.hpp> #include <cudf/search.hpp> #include <cudf/types.hpp>
static constexpr cudf::size_type num_struct_members = 8; static constexpr cudf::size_type max_int = 100; static constexpr cudf::size_type max_str_length = 32;
static auto create_random_structs_column(cudf::size_type n_rows) { data_profile table_profile; table_profile.set_distribution_params(cudf::type_id::INT32, distribution_id::UNIFORM, 0, max_int); table_profile.set_distribution_params( cudf::type_id::STRING, distribution_id::NORMAL, 0, max_str_length);
// The first two struct members are int32 and string. // The first column is also used as keys in groupby. // The subsequent struct members are int32 and string again. auto table = create_random_table( cycle_dtypes({cudf::type_id::INT32, cudf::type_id::STRING}, num_struct_members), row_count{n_rows}, table_profile); return cudf::make_structs_column(n_rows, table->release(), 0, {}); }
void BM_contains(benchmark::State& state) { auto const size{static_castcudf::size_type(state.range(0))};
// constexpr cudf::size_type repeat_times = 4; // <25% unique rows // constexpr cudf::size_type repeat_times = 2; // <50% unique rows constexpr int repeat_times = 1; // <100% unique rows
auto const needles = create_random_structs_column(size);
auto haystack = create_random_structs_column(size / repeat_times); auto const haystack0 = std::make_uniquecudf::column(*haystack);
for (int i = 0; i < repeat_times - 1; ++i) { haystack = cudf::concatenate(std::vectorcudf::column_view{haystack0->view(), haystack->view()}); }
for ([[maybe_unused]] auto _ : state) { [[maybe_unused]] auto const timer = cuda_event_timer(state, true); auto const result = cudf::contains(*haystack, *needles); } }
class Search : public cudf::benchmark { };
BENCHMARK_DEFINE_F(Search, ColumnContains)(::benchmark::State& state) { BM_contains(state); }
BENCHMARK_REGISTER_F(Search, ColumnContains) ->RangeMultiplier(8) ->Ranges({{1 << 10, 1 << 26}}) ->UseManualTime() ->Unit(benchmark::kMillisecond);
</details>
containsis just another word forsemi_join. They should share the same implementation. Can't believe I didn't realize this sooner.
I have filed a FEA issue: https://github.com/rapidsai/cudf/issues/11037
I actually can do that right now in parallel, as the current (semi-, anti-) joins don't need nested types. The joins just simply call cudf::detail::contains internally.
@jrhemstad @PointKernel FYI: I ran benchmarks to compare the performance of using static_map vs static_multimap + pair_contains.
Data is generated as structs columns having 2 int columns, each row has value at max n_rows / 100.
Benchmark with data generated as described above:
Benchmark Time CPU Time Old Time New CPU Old CPU New
-----------------------------------------------------------------------------------------------------------------------------------------
Search/ColumnContains/1024/manual_time -0.1596 -0.1455 0 0 0 0
Search/ColumnContains/4096/manual_time -0.1766 -0.1616 0 0 0 0
Search/ColumnContains/32768/manual_time +0.0084 +0.0099 0 0 0 0
Search/ColumnContains/262144/manual_time +0.2606 +0.2500 0 1 0 1
Search/ColumnContains/2097152/manual_time +0.4415 +0.4375 3 5 3 5
Search/ColumnContains/16777216/manual_time +0.2069 +0.2118 35 43 35 43
Search/ColumnContains/67108864/manual_time -0.0245 -0.0197 235 229 234 229
Benchmark with the search space duplicate X2
Benchmark Time CPU Time Old Time New CPU Old CPU New
-----------------------------------------------------------------------------------------------------------------------------------------
Search/ColumnContains/1024/manual_time -0.3539 -0.3286 0 0 0 0
Search/ColumnContains/4096/manual_time -0.2558 -0.2365 0 0 0 0
Search/ColumnContains/32768/manual_time -0.2951 -0.2782 0 0 0 0
Search/ColumnContains/262144/manual_time -0.3944 -0.3930 1 0 1 0
Search/ColumnContains/2097152/manual_time -0.4951 -0.5014 6 3 6 3
Search/ColumnContains/16777216/manual_time -0.7331 -0.7345 96 26 96 26
Search/ColumnContains/67108864/manual_time -0.8849 -0.8855 1049 121 1049 120
Benchmark with the search space duplicate X4
Benchmark Time CPU Time Old Time New CPU Old CPU New
-----------------------------------------------------------------------------------------------------------------------------------------
Search/ColumnContains/1024/manual_time -0.4520 -0.4227 0 0 0 0
Search/ColumnContains/4096/manual_time -0.5056 -0.4823 0 0 0 0
Search/ColumnContains/32768/manual_time -0.5579 -0.5399 0 0 0 0
Search/ColumnContains/262144/manual_time -0.6034 -0.6015 1 0 1 0
Search/ColumnContains/2097152/manual_time -0.6276 -0.6261 9 3 9 3
Search/ColumnContains/16777216/manual_time -0.7760 -0.7759 120 27 120 27
Search/ColumnContains/67108864/manual_time -0.8998 -0.8998 1226 123 1227 123
Data is generated as structs column of 8 children (4 int columns in range of [0, n_rows) and 4 string columns with string max size is 10).
Benchmark with data generated as described above:
Benchmark Time CPU Time Old Time New CPU Old CPU New
-----------------------------------------------------------------------------------------------------------------------------------------
Search/ColumnContains/1024/manual_time -0.1280 -0.1232 0 0 0 0
Search/ColumnContains/4096/manual_time -0.0265 -0.0221 0 0 0 0
Search/ColumnContains/32768/manual_time +0.1367 +0.1371 0 1 1 1
Search/ColumnContains/262144/manual_time +0.6938 +0.6712 1 2 1 2
Search/ColumnContains/2097152/manual_time +1.0182 +1.0424 7 13 7 13
Search/ColumnContains/16777216/manual_time +1.0218 +1.0189 52 105 52 105
Search/ColumnContains/67108864/manual_time +0.9987 +0.9910 211 421 215 428
Benchmark with the search space duplicate X2
Benchmark Time CPU Time Old Time New CPU Old CPU New
-----------------------------------------------------------------------------------------------------------------------------------------
Search/ColumnContains/1024/manual_time -0.0370 -0.0370 0 0 0 0
Search/ColumnContains/4096/manual_time -0.0200 -0.0201 0 0 0 0
Search/ColumnContains/32768/manual_time -0.0199 -0.0224 0 0 1 0
Search/ColumnContains/262144/manual_time +0.1200 +0.1057 1 1 1 1
Search/ColumnContains/2097152/manual_time +0.1476 +0.1416 6 7 6 7
Search/ColumnContains/16777216/manual_time +0.0639 +0.0631 51 54 51 54
Search/ColumnContains/67108864/manual_time -0.0174 -0.0172 223 220 224 220
Benchmark with the search space duplicate X4
Benchmark Time CPU Time Old Time New CPU Old CPU New
-----------------------------------------------------------------------------------------------------------------------------------------
Search/ColumnContains/1024/manual_time -0.0972 -0.0940 0 0 0 0
Search/ColumnContains/4096/manual_time -0.0769 -0.0746 0 0 0 0
Search/ColumnContains/32768/manual_time -0.0733 -0.0731 1 0 1 0
Search/ColumnContains/262144/manual_time -0.0386 -0.0479 1 1 1 1
Search/ColumnContains/2097152/manual_time -0.0711 -0.0770 8 7 8 7
Search/ColumnContains/16777216/manual_time -0.1510 -0.1463 68 58 68 58
Search/ColumnContains/67108864/manual_time -0.1886 -0.1847 303 246 302 246
Benchmark with the search space duplicate X10
Benchmark Time CPU Time Old Time New CPU Old CPU New
-----------------------------------------------------------------------------------------------------------------------------------------
Search/ColumnContains/1024/manual_time -0.1900 -0.1855 0 0 0 0
Search/ColumnContains/4096/manual_time -0.1939 -0.1901 0 0 0 0
Search/ColumnContains/32768/manual_time -0.1654 -0.1640 1 1 1 1
Search/ColumnContains/262144/manual_time -0.2850 -0.2949 2 1 2 1
Search/ColumnContains/2097152/manual_time -0.4071 -0.4084 13 8 13 8
Search/ColumnContains/16777216/manual_time -0.4767 -0.4769 122 64 122 64
Search/ColumnContains/67108864/manual_time -0.5598 -0.5601 557 245 558 245
So if the search space has no/little duplicates, using static_multimap + pair_contains is significantly slower? On the other hand, if we have a lot of duplicates then it is much faster.
if the search space has no/little duplicates, using static_multimap + pair_contains is significantly slower
This is expected since multimap is optimized to handle duplicates. i.e. if few duplicates are present, the fact that we are using CG to improve multimap efficiency could actually become a performance burden. I noticed you are using the default probing scheme with CG size = 8. You may want to tune the CG size (e.g. 2 or 4) a bit to find a better fit for contains. A similar use case in hash join for your reference: https://github.com/rapidsai/cudf/blob/0ddb3d9319426da49d8cb4b9cbb95819dc9b5263/cpp/include/cudf/detail/join.hpp#L62
Here is the second benchmark above, run again, with CG=1 and CG=2.
Data is generated as structs column of 8 children (4 int columns in range of [0, n_rows) and 4 string columns with string max size is 10).
CG = 1, no duplicate data
Benchmark Time CPU Time Old Time New CPU Old CPU New
-----------------------------------------------------------------------------------------------------------------------------------------
Search/ColumnContains/1024/manual_time -0.1104 -0.1073 0 0 0 0
Search/ColumnContains/4096/manual_time -0.0113 -0.0140 0 0 0 0
Search/ColumnContains/32768/manual_time -0.0900 -0.0950 0 0 1 0
Search/ColumnContains/262144/manual_time -0.2322 -0.2329 1 1 1 1
Search/ColumnContains/2097152/manual_time -0.2892 -0.2974 7 5 7 5
Search/ColumnContains/16777216/manual_time -0.2826 -0.2822 51 36 51 36
Search/ColumnContains/67108864/manual_time -0.2918 -0.2938 205 145 209 148
CG = 2, no duplicate data
Benchmark Time CPU Time Old Time New CPU Old CPU New
-----------------------------------------------------------------------------------------------------------------------------------------
Search/ColumnContains/1024/manual_time +0.0004 +0.0015 0 0 0 0
Search/ColumnContains/4096/manual_time -0.0035 -0.0016 0 0 0 0
Search/ColumnContains/32768/manual_time -0.0477 -0.0478 0 0 0 0
Search/ColumnContains/262144/manual_time -0.1013 -0.1018 1 1 1 1
Search/ColumnContains/2097152/manual_time -0.1116 -0.1129 6 6 6 6
Search/ColumnContains/16777216/manual_time -0.1391 -0.1390 51 44 51 44
Search/ColumnContains/67108864/manual_time -0.1507 -0.1527 205 175 210 178
CG = 1, data X10
Benchmark Time CPU Time Old Time New CPU Old CPU New
-----------------------------------------------------------------------------------------------------------------------------------------
Search/ColumnContains/1024/manual_time -0.2539 -0.2480 0 0 0 0
Search/ColumnContains/4096/manual_time -0.1595 -0.1540 0 0 0 0
Search/ColumnContains/32768/manual_time -0.3552 -0.3495 1 0 1 0
Search/ColumnContains/262144/manual_time -0.6734 -0.6763 2 1 2 1
Search/ColumnContains/2097152/manual_time -0.6602 -0.6605 13 4 13 4
Search/ColumnContains/16777216/manual_time -0.6899 -0.6965 117 36 117 36
Search/ColumnContains/67108864/manual_time -0.7296 -0.7302 541 146 542 146
CG = 2, data X10
Benchmark Time CPU Time Old Time New CPU Old CPU New
-----------------------------------------------------------------------------------------------------------------------------------------
Search/ColumnContains/1024/manual_time -0.1274 -0.1253 0 0 0 0
Search/ColumnContains/4096/manual_time -0.1826 -0.1785 0 0 0 0
Search/ColumnContains/32768/manual_time -0.2935 -0.2920 1 0 1 0
Search/ColumnContains/262144/manual_time -0.6109 -0.6108 2 1 2 1
Search/ColumnContains/2097152/manual_time -0.6942 -0.6946 13 4 13 4
Search/ColumnContains/16777216/manual_time -0.7344 -0.7349 120 32 120 32
Search/ColumnContains/67108864/manual_time -0.7647 -0.7655 542 127 543 127
So I'm going to use CG = 1 as default.
Just to make sure: does time old correspond to the static_map implementation? I assume data X10 refers to multiplicity == 10, right?
Just to make sure: does
time oldcorrespond to thestatic_mapimplementation?
Yes.
I assume
data X10refers to multiplicity == 10, right?
X10 means data was generated randomly then duplicated 10 times.
@PointKernel this is another case where making progress on the unification of different cuco map types would be beneficial. If @ttnghia is getting the best performance with a CG size of 1, he'd probably get even better performance with a code path that doesn't include the CG boilerplate at all, but IIRC that only exists for static_map and not static_multimap right now.
this is another case where making progress on the unification of different cuco map types would be beneficial.
Time to kick off the grand refactoring then. atomic_ref is more or less ready so I think we are good to go.
he'd probably get even better performance with a code path that doesn't include the CG boilerplate at all
Agreed.
Time to kick off the grand refactoring then.
atomic_refis more or less ready so I think we are good to go.
Because you mentioned atomic_ref, here is an important bug that we must fix IMO before we can use it: https://github.com/NVIDIA/libcudacxx/issues/279#issuecomment-1149229771
Time to kick off the grand refactoring then.
atomic_refis more or less ready so I think we are good to go.Because you mentioned
atomic_ref, here is an important bug that we must fix IMO before we can use it: NVIDIA/libcudacxx#279 (comment)
That's what my "more or less" refers to. :smile:
This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.
Alright, this PR is awakened. Now it has a totally new implementation, which is just around 10 LOC.
@gpucibot merge