cudf `strings::contains()` for multiple scalar search targets

Description

This commit adds a new strings::contains() overload that allows for the search of multiple scalar search targets in the same call.

The trick here is that a new kernel has been introduced, to extend the "string-per-warp" approach to search for multiple search keys in the same kernel.

This approach allows CUDF to potentially reduce the number of kernels launched for string::contains() by a factor of N, if all the search keys can be specified in the same call. This helps reduce the kernel-launch overheads for processes that do large numbers of calls to string::contains().

Checklist

[x] I am familiar with the Contributing Guidelines.
[x] New or existing tests cover these changes.
[x] The documentation is up to date with these changes.

Apr 15 '24 20:04 mythrocks

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Apr 15 '24 20:04 copy-pr-bot[bot]

Tagging @davidwendt.

Apr 15 '24 20:04 mythrocks

/ok to test

Apr 15 '24 21:04 davidwendt

I've placed this in draft, for the moment. I should do some more testing from spark-rapids, to make sure this fits.

Apr 15 '24 22:04 mythrocks

I recommend looking at the changes to find.cu here: https://github.com/rapidsai/cudf/pull/15495/files including the commented out code. We could probably remove the fill call.

Apr 16 '24 11:04 davidwendt

I should have updated here sooner.

It's possible that I was mistaken regarding the improvement from this change. I can't seem to reproduce the perf gains from last week. From my benchmarks, searching for 5 strings in a 1M row search space:

contains_warp_parallel: 2.2ms per call => 11ms for 5 strings.
multi_contains_warp_parallel: 11.1ms for 5 strings.

It doesn't look like there's much to be gained here.

I'll keep testing a little, to see if I've missed something from last week. If there aren't any gains to be found, I'll close this PR.

Apr 18 '24 18:04 mythrocks

Hat tip to @tgujar for this one:

For the multi-string kernel, this part is interesting:

  auto const lane_idx   = idx % cudf::detail::warp_size;
  auto const str_idx    = (idx / cudf::detail::warp_size) % num_rows;
  auto const target_idx = (idx / cudf::detail::warp_size) / num_rows;

I was iterating such that the first num_rows rows are checked for the first target string, then next num_rows for the second, etc. The suggestion was to switch the order so that the first input string is checked for all t target strings before going to the second input string.

On a 1M input column, with 5 search strings, calling the multi-contains kernel takes 9.4ms. Calling the single contains-kernel 5 times sequentially takes 10.64ms. It looks like a 11% speedup.

Apr 23 '24 20:04 mythrocks

On my last update, I indicated some positive news regarding total kernel times: Batching calls to strings::contains() for multiple targets reduces the total time spent in the kernel. For 100M rows, 10 targets:

Single target kernel: 1.3s
Multi-target kernel: 0.9s

However, the end-to-end query times did not improve. The cumulative time spent in strings::contains() was very high with the multi-kernel:

Single target kernel: 498ms
Multi-target kernel: 1843ms

I've removed the call to thrust::fill. Per advice from @nvdbaranec and @abellina, I've switched to using rmm::device_uvector instead of io::hostdevice_vector, and turned the pinned-pool allocator on. There have been improvements:

Algo	Device datastructure	Pinned Pool setting	Avg Kernel time (ms)	Total Kernel Time (ms)	Total function time (ms)
Baseline (1 target/kernel x 10 targets)	-	0g	0.545	1091	498
Baseline (1 target/kernel x 10 targets)	-	8g	0.565	1131	562.59
Multi Kernel (10 targets/kernel)	io::hostdev_vector	0g	4.61	922.05	1885
Multi Kernel (10 targets/kernel)	io::hostdev_vector	8g	4.69	938.167	252.191
Multi Kernel (10 targets/kernel)	rmm::device_vector	0g	4.604	920.875	389.096
Multi Kernel (10 targets/kernel)	rmm::device_vector	8g	4.7	939.957	231.817

The gains in the total kernel time remain about the same: string_contains kernel times

But there is reasonable improvement in the time spent in cudf::strings::contains: string_contains function times

It looks like we're closer to the solution. It might warrant more testing with the user's actual data.

May 07 '24 23:05 mythrocks

From my experimentation, it appears that we might have taken this nearly as far as we can with this algo. There will be incremental improvements if we allocate the multiple targets in a contiguous allocation, and space out the indices for consecutive threads.

I will try make those changes available in this PR shortly.

I've been chasing another lead with the approach detailed in this paper. This proposes to implement KMP or Aho Corasick on CUDA. It's out of scope for this PR, but it's a promising option.

May 17 '24 18:05 mythrocks

I've moved this over to 24.08, rather than rush this to 24.06. There is a parallel effort on to explore KMP or Aho-Corasick, to see if that fits better.

May 28 '24 17:05 mythrocks

I've been unable to spend time on this.

I'm closing this for now. I'll try hit this again later, with the modified KMP approach.

Jul 31 '24 00:07 mythrocks