cudf icon indicating copy to clipboard operation
cudf copied to clipboard

Fully support nested types in `cudf::contains`

Open ttnghia opened this issue 3 years ago • 17 comments

This extends the cudf::contains API to support nested types (lists + structs) with arbitrarily nested levels. As such, cudf::contains will work with literally any type of input data.

In addition, this fixes null handling of cudf::contains with structs column + struct scalar input when the structs column contains null rows at the top level while the scalar key is valid but all nulls at children levels.

Closes: https://github.com/rapidsai/cudf/issues/8965 Depends on:

  • https://github.com/rapidsai/cudf/pull/10730
  • https://github.com/rapidsai/cudf/pull/10883
  • https://github.com/rapidsai/cudf/pull/10802
  • https://github.com/rapidsai/cudf/pull/10997
  • https://github.com/NVIDIA/cuCollections/issues/172
  • https://github.com/NVIDIA/cuCollections/issues/173
  • https://github.com/rapidsai/cudf/issues/11037
  • https://github.com/rapidsai/cudf/pull/11356

ttnghia avatar Apr 14 '22 04:04 ttnghia

Codecov Report

:exclamation: No coverage uploaded for pull request base (branch-22.10@abd4302). Click here to learn what that means. The diff coverage is n/a.

:exclamation: Current head 42d75b1 differs from pull request most recent head 28f115f. Consider uploading reports for the commit 28f115f to get more accurate results

@@               Coverage Diff               @@
##             branch-22.10   #10656   +/-   ##
===============================================
  Coverage                ?   86.36%           
===============================================
  Files                   ?      145           
  Lines                   ?    22949           
  Branches                ?        0           
===============================================
  Hits                    ?    19820           
  Misses                  ?     3129           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

codecov[bot] avatar Apr 14 '22 05:04 codecov[bot]

Currently hold off due to dependency. However, a temporary implementation could be be pushed just for testing, and will be replaced by a new implementation before the PR is ready for review.

ttnghia avatar Apr 15 '22 16:04 ttnghia

Sorry for the misclicking. Just wanted to check some of the commits...

PointKernel avatar May 14 '22 00:05 PointKernel

Due to time constraint, we decided to retarget this into 22.08.

ttnghia avatar May 21 '22 02:05 ttnghia

Here is the benchmark regarding to https://github.com/rapidsai/cudf/pull/10656#discussion_r886218927, comparing two approach:

  1. Insert all haystack indices into the hash map: no row comparison
  2. Insert only indices of the unique rows of haystack: row equality comparisons using self_comparator

The benchmarks were on structs column of 8 children (4 int children + 4 string children) generated randomly.

From the results, we can observe that without (much) duplicates the first approach is faster as it doesn't do row comparisons. With much duplicates (>75%) the second approach is faster while the first approach becomes slower due to hash collision and linear probe.

Haystack column with almost no duplicates

Benchmark                                                    Time             CPU      Time Old      Time New       CPU Old       CPU New
-----------------------------------------------------------------------------------------------------------------------------------------
Search/ColumnContains/1024/manual_time                    +0.2861         +0.2749             0             1             0             1
Search/ColumnContains/4096/manual_time                    +0.2424         +0.2330             0             1             0             1
Search/ColumnContains/32768/manual_time                   +0.2863         +0.2782             0             1             1             1
Search/ColumnContains/262144/manual_time                  +0.3013         +0.3163             1             2             1             2
Search/ColumnContains/2097152/manual_time                 +0.3324         +0.3322             8            10             8            10
Search/ColumnContains/16777216/manual_time                +0.3485         +0.3445            61            82            61            82
Search/ColumnContains/67108864/manual_time                +0.3367         +0.3439           247           330           254           341

Haystack column repeated twice (>50% duplicates)

Benchmark                                                    Time             CPU      Time Old      Time New       CPU Old       CPU New
-----------------------------------------------------------------------------------------------------------------------------------------
Search/ColumnContains/1024/manual_time                    +0.1489         +0.1395             0             0             0             0
Search/ColumnContains/4096/manual_time                    +0.2634         +0.2560             0             1             0             1
Search/ColumnContains/32768/manual_time                   +0.0975         +0.0890             1             1             1             1
Search/ColumnContains/262144/manual_time                  +0.2431         +0.2401             1             1             1             1
Search/ColumnContains/2097152/manual_time                 +0.2031         +0.2029             7             9             7             9
Search/ColumnContains/16777216/manual_time                +0.1763         +0.1762            56            66            56            66
Search/ColumnContains/67108864/manual_time                +0.1098         +0.1095           243           269           243           270

Haystack column repeated 4 times (>75% duplicates)

Benchmark                                                    Time             CPU      Time Old      Time New       CPU Old       CPU New
-----------------------------------------------------------------------------------------------------------------------------------------
Search/ColumnContains/1024/manual_time                    +0.0650         +0.0569             0             0             0             0
Search/ColumnContains/4096/manual_time                    +0.1890         +0.1795             0             1             0             1
Search/ColumnContains/32768/manual_time                   +0.1281         +0.1229             1             1             1             1
Search/ColumnContains/262144/manual_time                  +0.0550         +0.0530             1             1             1             1
Search/ColumnContains/2097152/manual_time                 -0.0520         -0.0582             8             8             8             8
Search/ColumnContains/16777216/manual_time                -0.1328         -0.1329            69            60            69            60
Search/ColumnContains/67108864/manual_time                -0.1936         -0.1944           298           240           300           241

Code for benchmark

``` #include #include #include

#include <cudf/column/column_factories.hpp> #include <cudf/concatenate.hpp> #include <cudf/search.hpp> #include <cudf/types.hpp>

static constexpr cudf::size_type num_struct_members = 8; static constexpr cudf::size_type max_int = 100; static constexpr cudf::size_type max_str_length = 32;

static auto create_random_structs_column(cudf::size_type n_rows) { data_profile table_profile; table_profile.set_distribution_params(cudf::type_id::INT32, distribution_id::UNIFORM, 0, max_int); table_profile.set_distribution_params( cudf::type_id::STRING, distribution_id::NORMAL, 0, max_str_length);

// The first two struct members are int32 and string. // The first column is also used as keys in groupby. // The subsequent struct members are int32 and string again. auto table = create_random_table( cycle_dtypes({cudf::type_id::INT32, cudf::type_id::STRING}, num_struct_members), row_count{n_rows}, table_profile); return cudf::make_structs_column(n_rows, table->release(), 0, {}); }

void BM_contains(benchmark::State& state) { auto const size{static_castcudf::size_type(state.range(0))};

// constexpr cudf::size_type repeat_times = 4; // <25% unique rows // constexpr cudf::size_type repeat_times = 2; // <50% unique rows constexpr int repeat_times = 1; // <100% unique rows

auto const needles = create_random_structs_column(size);

auto haystack = create_random_structs_column(size / repeat_times); auto const haystack0 = std::make_uniquecudf::column(*haystack);

for (int i = 0; i < repeat_times - 1; ++i) { haystack = cudf::concatenate(std::vectorcudf::column_view{haystack0->view(), haystack->view()}); }

for ([[maybe_unused]] auto _ : state) { [[maybe_unused]] auto const timer = cuda_event_timer(state, true); auto const result = cudf::contains(*haystack, *needles); } }

class Search : public cudf::benchmark { };

BENCHMARK_DEFINE_F(Search, ColumnContains)(::benchmark::State& state) { BM_contains(state); }

BENCHMARK_REGISTER_F(Search, ColumnContains) ->RangeMultiplier(8) ->Ranges({{1 << 10, 1 << 26}}) ->UseManualTime() ->Unit(benchmark::kMillisecond);

</details>

ttnghia avatar Jun 02 '22 16:06 ttnghia

contains is just another word for semi_join. They should share the same implementation. Can't believe I didn't realize this sooner.

I have filed a FEA issue: https://github.com/rapidsai/cudf/issues/11037

I actually can do that right now in parallel, as the current (semi-, anti-) joins don't need nested types. The joins just simply call cudf::detail::contains internally.

ttnghia avatar Jun 03 '22 05:06 ttnghia

@jrhemstad @PointKernel FYI: I ran benchmarks to compare the performance of using static_map vs static_multimap + pair_contains.


Data is generated as structs columns having 2 int columns, each row has value at max n_rows / 100.

Benchmark with data generated as described above:

Benchmark                                                    Time             CPU      Time Old      Time New       CPU Old       CPU New
-----------------------------------------------------------------------------------------------------------------------------------------
Search/ColumnContains/1024/manual_time                    -0.1596         -0.1455             0             0             0             0
Search/ColumnContains/4096/manual_time                    -0.1766         -0.1616             0             0             0             0
Search/ColumnContains/32768/manual_time                   +0.0084         +0.0099             0             0             0             0
Search/ColumnContains/262144/manual_time                  +0.2606         +0.2500             0             1             0             1
Search/ColumnContains/2097152/manual_time                 +0.4415         +0.4375             3             5             3             5
Search/ColumnContains/16777216/manual_time                +0.2069         +0.2118            35            43            35            43
Search/ColumnContains/67108864/manual_time                -0.0245         -0.0197           235           229           234           229

Benchmark with the search space duplicate X2

Benchmark                                                    Time             CPU      Time Old      Time New       CPU Old       CPU New
-----------------------------------------------------------------------------------------------------------------------------------------
Search/ColumnContains/1024/manual_time                    -0.3539         -0.3286             0             0             0             0
Search/ColumnContains/4096/manual_time                    -0.2558         -0.2365             0             0             0             0
Search/ColumnContains/32768/manual_time                   -0.2951         -0.2782             0             0             0             0
Search/ColumnContains/262144/manual_time                  -0.3944         -0.3930             1             0             1             0
Search/ColumnContains/2097152/manual_time                 -0.4951         -0.5014             6             3             6             3
Search/ColumnContains/16777216/manual_time                -0.7331         -0.7345            96            26            96            26
Search/ColumnContains/67108864/manual_time                -0.8849         -0.8855          1049           121          1049           120

Benchmark with the search space duplicate X4

Benchmark                                                    Time             CPU      Time Old      Time New       CPU Old       CPU New
-----------------------------------------------------------------------------------------------------------------------------------------
Search/ColumnContains/1024/manual_time                    -0.4520         -0.4227             0             0             0             0
Search/ColumnContains/4096/manual_time                    -0.5056         -0.4823             0             0             0             0
Search/ColumnContains/32768/manual_time                   -0.5579         -0.5399             0             0             0             0
Search/ColumnContains/262144/manual_time                  -0.6034         -0.6015             1             0             1             0
Search/ColumnContains/2097152/manual_time                 -0.6276         -0.6261             9             3             9             3
Search/ColumnContains/16777216/manual_time                -0.7760         -0.7759           120            27           120            27
Search/ColumnContains/67108864/manual_time                -0.8998         -0.8998          1226           123          1227           123

Data is generated as structs column of 8 children (4 int columns in range of [0, n_rows) and 4 string columns with string max size is 10).

Benchmark with data generated as described above:

Benchmark                                                    Time             CPU      Time Old      Time New       CPU Old       CPU New
-----------------------------------------------------------------------------------------------------------------------------------------
Search/ColumnContains/1024/manual_time                    -0.1280         -0.1232             0             0             0             0
Search/ColumnContains/4096/manual_time                    -0.0265         -0.0221             0             0             0             0
Search/ColumnContains/32768/manual_time                   +0.1367         +0.1371             0             1             1             1
Search/ColumnContains/262144/manual_time                  +0.6938         +0.6712             1             2             1             2
Search/ColumnContains/2097152/manual_time                 +1.0182         +1.0424             7            13             7            13
Search/ColumnContains/16777216/manual_time                +1.0218         +1.0189            52           105            52           105
Search/ColumnContains/67108864/manual_time                +0.9987         +0.9910           211           421           215           428

Benchmark with the search space duplicate X2

Benchmark                                                    Time             CPU      Time Old      Time New       CPU Old       CPU New
-----------------------------------------------------------------------------------------------------------------------------------------
Search/ColumnContains/1024/manual_time                    -0.0370         -0.0370             0             0             0             0
Search/ColumnContains/4096/manual_time                    -0.0200         -0.0201             0             0             0             0
Search/ColumnContains/32768/manual_time                   -0.0199         -0.0224             0             0             1             0
Search/ColumnContains/262144/manual_time                  +0.1200         +0.1057             1             1             1             1
Search/ColumnContains/2097152/manual_time                 +0.1476         +0.1416             6             7             6             7
Search/ColumnContains/16777216/manual_time                +0.0639         +0.0631            51            54            51            54
Search/ColumnContains/67108864/manual_time                -0.0174         -0.0172           223           220           224           220

Benchmark with the search space duplicate X4

Benchmark                                                    Time             CPU      Time Old      Time New       CPU Old       CPU New
-----------------------------------------------------------------------------------------------------------------------------------------
Search/ColumnContains/1024/manual_time                    -0.0972         -0.0940             0             0             0             0
Search/ColumnContains/4096/manual_time                    -0.0769         -0.0746             0             0             0             0
Search/ColumnContains/32768/manual_time                   -0.0733         -0.0731             1             0             1             0
Search/ColumnContains/262144/manual_time                  -0.0386         -0.0479             1             1             1             1
Search/ColumnContains/2097152/manual_time                 -0.0711         -0.0770             8             7             8             7
Search/ColumnContains/16777216/manual_time                -0.1510         -0.1463            68            58            68            58
Search/ColumnContains/67108864/manual_time                -0.1886         -0.1847           303           246           302           246

Benchmark with the search space duplicate X10

Benchmark                                                    Time             CPU      Time Old      Time New       CPU Old       CPU New
-----------------------------------------------------------------------------------------------------------------------------------------
Search/ColumnContains/1024/manual_time                    -0.1900         -0.1855             0             0             0             0
Search/ColumnContains/4096/manual_time                    -0.1939         -0.1901             0             0             0             0
Search/ColumnContains/32768/manual_time                   -0.1654         -0.1640             1             1             1             1
Search/ColumnContains/262144/manual_time                  -0.2850         -0.2949             2             1             2             1
Search/ColumnContains/2097152/manual_time                 -0.4071         -0.4084            13             8            13             8
Search/ColumnContains/16777216/manual_time                -0.4767         -0.4769           122            64           122            64
Search/ColumnContains/67108864/manual_time                -0.5598         -0.5601           557           245           558           245

So if the search space has no/little duplicates, using static_multimap + pair_contains is significantly slower? On the other hand, if we have a lot of duplicates then it is much faster.

ttnghia avatar Jun 13 '22 04:06 ttnghia

if the search space has no/little duplicates, using static_multimap + pair_contains is significantly slower

This is expected since multimap is optimized to handle duplicates. i.e. if few duplicates are present, the fact that we are using CG to improve multimap efficiency could actually become a performance burden. I noticed you are using the default probing scheme with CG size = 8. You may want to tune the CG size (e.g. 2 or 4) a bit to find a better fit for contains. A similar use case in hash join for your reference: https://github.com/rapidsai/cudf/blob/0ddb3d9319426da49d8cb4b9cbb95819dc9b5263/cpp/include/cudf/detail/join.hpp#L62

PointKernel avatar Jun 13 '22 12:06 PointKernel

Here is the second benchmark above, run again, with CG=1 and CG=2.

Data is generated as structs column of 8 children (4 int columns in range of [0, n_rows) and 4 string columns with string max size is 10).

CG = 1, no duplicate data

Benchmark                                                    Time             CPU      Time Old      Time New       CPU Old       CPU New
-----------------------------------------------------------------------------------------------------------------------------------------
Search/ColumnContains/1024/manual_time                    -0.1104         -0.1073             0             0             0             0
Search/ColumnContains/4096/manual_time                    -0.0113         -0.0140             0             0             0             0
Search/ColumnContains/32768/manual_time                   -0.0900         -0.0950             0             0             1             0
Search/ColumnContains/262144/manual_time                  -0.2322         -0.2329             1             1             1             1
Search/ColumnContains/2097152/manual_time                 -0.2892         -0.2974             7             5             7             5
Search/ColumnContains/16777216/manual_time                -0.2826         -0.2822            51            36            51            36
Search/ColumnContains/67108864/manual_time                -0.2918         -0.2938           205           145           209           148

CG = 2, no duplicate data

Benchmark                                                    Time             CPU      Time Old      Time New       CPU Old       CPU New
-----------------------------------------------------------------------------------------------------------------------------------------
Search/ColumnContains/1024/manual_time                    +0.0004         +0.0015             0             0             0             0
Search/ColumnContains/4096/manual_time                    -0.0035         -0.0016             0             0             0             0
Search/ColumnContains/32768/manual_time                   -0.0477         -0.0478             0             0             0             0
Search/ColumnContains/262144/manual_time                  -0.1013         -0.1018             1             1             1             1
Search/ColumnContains/2097152/manual_time                 -0.1116         -0.1129             6             6             6             6
Search/ColumnContains/16777216/manual_time                -0.1391         -0.1390            51            44            51            44
Search/ColumnContains/67108864/manual_time                -0.1507         -0.1527           205           175           210           178

CG = 1, data X10

Benchmark                                                    Time             CPU      Time Old      Time New       CPU Old       CPU New
-----------------------------------------------------------------------------------------------------------------------------------------
Search/ColumnContains/1024/manual_time                    -0.2539         -0.2480             0             0             0             0
Search/ColumnContains/4096/manual_time                    -0.1595         -0.1540             0             0             0             0
Search/ColumnContains/32768/manual_time                   -0.3552         -0.3495             1             0             1             0
Search/ColumnContains/262144/manual_time                  -0.6734         -0.6763             2             1             2             1
Search/ColumnContains/2097152/manual_time                 -0.6602         -0.6605            13             4            13             4
Search/ColumnContains/16777216/manual_time                -0.6899         -0.6965           117            36           117            36
Search/ColumnContains/67108864/manual_time                -0.7296         -0.7302           541           146           542           146

CG = 2, data X10

Benchmark                                                    Time             CPU      Time Old      Time New       CPU Old       CPU New
-----------------------------------------------------------------------------------------------------------------------------------------
Search/ColumnContains/1024/manual_time                    -0.1274         -0.1253             0             0             0             0
Search/ColumnContains/4096/manual_time                    -0.1826         -0.1785             0             0             0             0
Search/ColumnContains/32768/manual_time                   -0.2935         -0.2920             1             0             1             0
Search/ColumnContains/262144/manual_time                  -0.6109         -0.6108             2             1             2             1
Search/ColumnContains/2097152/manual_time                 -0.6942         -0.6946            13             4            13             4
Search/ColumnContains/16777216/manual_time                -0.7344         -0.7349           120            32           120            32
Search/ColumnContains/67108864/manual_time                -0.7647         -0.7655           542           127           543           127

So I'm going to use CG = 1 as default.

ttnghia avatar Jun 13 '22 14:06 ttnghia

Just to make sure: does time old correspond to the static_map implementation? I assume data X10 refers to multiplicity == 10, right?

PointKernel avatar Jun 13 '22 14:06 PointKernel

Just to make sure: does time old correspond to the static_map implementation?

Yes.

I assume data X10 refers to multiplicity == 10, right?

X10 means data was generated randomly then duplicated 10 times.

ttnghia avatar Jun 13 '22 15:06 ttnghia

@PointKernel this is another case where making progress on the unification of different cuco map types would be beneficial. If @ttnghia is getting the best performance with a CG size of 1, he'd probably get even better performance with a code path that doesn't include the CG boilerplate at all, but IIRC that only exists for static_map and not static_multimap right now.

vyasr avatar Jun 14 '22 19:06 vyasr

this is another case where making progress on the unification of different cuco map types would be beneficial.

Time to kick off the grand refactoring then. atomic_ref is more or less ready so I think we are good to go.

he'd probably get even better performance with a code path that doesn't include the CG boilerplate at all

Agreed.

PointKernel avatar Jun 14 '22 19:06 PointKernel

Time to kick off the grand refactoring then. atomic_ref is more or less ready so I think we are good to go.

Because you mentioned atomic_ref, here is an important bug that we must fix IMO before we can use it: https://github.com/NVIDIA/libcudacxx/issues/279#issuecomment-1149229771

ttnghia avatar Jun 14 '22 19:06 ttnghia

Time to kick off the grand refactoring then. atomic_ref is more or less ready so I think we are good to go.

Because you mentioned atomic_ref, here is an important bug that we must fix IMO before we can use it: NVIDIA/libcudacxx#279 (comment)

That's what my "more or less" refers to. :smile:

PointKernel avatar Jun 14 '22 19:06 PointKernel

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Jul 14 '22 20:07 github-actions[bot]

Alright, this PR is awakened. Now it has a totally new implementation, which is just around 10 LOC.

ttnghia avatar Jul 29 '22 20:07 ttnghia

@gpucibot merge

ttnghia avatar Aug 17 '22 21:08 ttnghia