cudf icon indicating copy to clipboard operation
cudf copied to clipboard

`strings::contains()` for multiple scalar search targets

Open mythrocks opened this issue 1 year ago • 11 comments

Description

This commit adds a new strings::contains() overload that allows for the search of multiple scalar search targets in the same call.

The trick here is that a new kernel has been introduced, to extend the "string-per-warp" approach to search for multiple search keys in the same kernel.

This approach allows CUDF to potentially reduce the number of kernels launched for string::contains() by a factor of N, if all the search keys can be specified in the same call. This helps reduce the kernel-launch overheads for processes that do large numbers of calls to string::contains().

Checklist

  • [x] I am familiar with the Contributing Guidelines.
  • [x] New or existing tests cover these changes.
  • [x] The documentation is up to date with these changes.

mythrocks avatar Apr 15 '24 20:04 mythrocks

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

copy-pr-bot[bot] avatar Apr 15 '24 20:04 copy-pr-bot[bot]

Tagging @davidwendt.

mythrocks avatar Apr 15 '24 20:04 mythrocks

/ok to test

davidwendt avatar Apr 15 '24 21:04 davidwendt

I've placed this in draft, for the moment. I should do some more testing from spark-rapids, to make sure this fits.

mythrocks avatar Apr 15 '24 22:04 mythrocks

I recommend looking at the changes to find.cu here: https://github.com/rapidsai/cudf/pull/15495/files including the commented out code. We could probably remove the fill call.

davidwendt avatar Apr 16 '24 11:04 davidwendt

I should have updated here sooner.

It's possible that I was mistaken regarding the improvement from this change. I can't seem to reproduce the perf gains from last week. From my benchmarks, searching for 5 strings in a 1M row search space:

  1. contains_warp_parallel: 2.2ms per call => 11ms for 5 strings.
  2. multi_contains_warp_parallel: 11.1ms for 5 strings.

It doesn't look like there's much to be gained here.

I'll keep testing a little, to see if I've missed something from last week. If there aren't any gains to be found, I'll close this PR.

mythrocks avatar Apr 18 '24 18:04 mythrocks

Hat tip to @tgujar for this one:

For the multi-string kernel, this part is interesting:

  auto const lane_idx   = idx % cudf::detail::warp_size;
  auto const str_idx    = (idx / cudf::detail::warp_size) % num_rows;
  auto const target_idx = (idx / cudf::detail::warp_size) / num_rows;

I was iterating such that the first num_rows rows are checked for the first target string, then next num_rows for the second, etc. The suggestion was to switch the order so that the first input string is checked for all t target strings before going to the second input string.

On a 1M input column, with 5 search strings, calling the multi-contains kernel takes 9.4ms. Calling the single contains-kernel 5 times sequentially takes 10.64ms. It looks like a 11% speedup.

mythrocks avatar Apr 23 '24 20:04 mythrocks

On my last update, I indicated some positive news regarding total kernel times: Batching calls to strings::contains() for multiple targets reduces the total time spent in the kernel. For 100M rows, 10 targets:

  1. Single target kernel: 1.3s
  2. Multi-target kernel: 0.9s

However, the end-to-end query times did not improve. The cumulative time spent in strings::contains() was very high with the multi-kernel:

  1. Single target kernel: 498ms
  2. Multi-target kernel: 1843ms

I've removed the call to thrust::fill. Per advice from @nvdbaranec and @abellina, I've switched to using rmm::device_uvector instead of io::hostdevice_vector, and turned the pinned-pool allocator on. There have been improvements:

Algo Device datastructure Pinned Pool setting Avg Kernel time (ms) Total Kernel Time (ms) Total function time (ms)
Baseline (1 target/kernel x 10 targets) - 0g 0.545 1091 498
Baseline (1 target/kernel x 10 targets) - 8g 0.565 1131 562.59
Multi Kernel (10 targets/kernel) io::hostdev_vector 0g 4.61 922.05 1885
Multi Kernel (10 targets/kernel) io::hostdev_vector 8g 4.69 938.167 252.191
Multi Kernel (10 targets/kernel) rmm::device_vector 0g 4.604 920.875 389.096
Multi Kernel (10 targets/kernel) rmm::device_vector 8g 4.7 939.957 231.817

The gains in the total kernel time remain about the same: string_contains kernel times

But there is reasonable improvement in the time spent in cudf::strings::contains: string_contains function times

It looks like we're closer to the solution. It might warrant more testing with the user's actual data.

mythrocks avatar May 07 '24 23:05 mythrocks

From my experimentation, it appears that we might have taken this nearly as far as we can with this algo. There will be incremental improvements if we allocate the multiple targets in a contiguous allocation, and space out the indices for consecutive threads.

I will try make those changes available in this PR shortly.

I've been chasing another lead with the approach detailed in this paper. This proposes to implement KMP or Aho Corasick on CUDA. It's out of scope for this PR, but it's a promising option.

mythrocks avatar May 17 '24 18:05 mythrocks

I've moved this over to 24.08, rather than rush this to 24.06. There is a parallel effort on to explore KMP or Aho-Corasick, to see if that fits better.

mythrocks avatar May 28 '24 17:05 mythrocks

I've been unable to spend time on this.

I'm closing this for now. I'll try hit this again later, with the modified KMP approach.

mythrocks avatar Jul 31 '24 00:07 mythrocks