`strings::contains()` for multiple scalar search targets
Description
This commit adds a new strings::contains() overload that allows for the search of multiple scalar search targets in the same call.
The trick here is that a new kernel has been introduced, to extend the "string-per-warp" approach to search for multiple search keys in the same kernel.
This approach allows CUDF to potentially reduce the number of kernels launched for string::contains() by a factor of N, if all the search keys can be specified in the same call. This helps reduce the kernel-launch overheads for processes that do large numbers of calls to string::contains().
Checklist
- [x] I am familiar with the Contributing Guidelines.
- [x] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.
This pull request requires additional validation before any workflows can run on NVIDIA's runners.
Pull request vetters can view their responsibilities here.
Contributors can view more details about this message here.
Tagging @davidwendt.
/ok to test
I've placed this in draft, for the moment. I should do some more testing from spark-rapids, to make sure this fits.
I recommend looking at the changes to find.cu here: https://github.com/rapidsai/cudf/pull/15495/files including the commented out code. We could probably remove the fill call.
I should have updated here sooner.
It's possible that I was mistaken regarding the improvement from this change. I can't seem to reproduce the perf gains from last week. From my benchmarks, searching for 5 strings in a 1M row search space:
-
contains_warp_parallel: 2.2ms per call => 11ms for 5 strings. -
multi_contains_warp_parallel: 11.1ms for 5 strings.
It doesn't look like there's much to be gained here.
I'll keep testing a little, to see if I've missed something from last week. If there aren't any gains to be found, I'll close this PR.
Hat tip to @tgujar for this one:
For the multi-string kernel, this part is interesting:
auto const lane_idx = idx % cudf::detail::warp_size;
auto const str_idx = (idx / cudf::detail::warp_size) % num_rows;
auto const target_idx = (idx / cudf::detail::warp_size) / num_rows;
I was iterating such that the first num_rows rows are checked for the first target string, then next num_rows for the second, etc.
The suggestion was to switch the order so that the first input string is checked for all t target strings before going to the second input string.
On a 1M input column, with 5 search strings, calling the multi-contains kernel takes 9.4ms. Calling the single contains-kernel 5 times sequentially takes 10.64ms. It looks like a 11% speedup.
On my last update, I indicated some positive news regarding total kernel times: Batching calls to strings::contains() for multiple targets reduces the total time spent in the kernel. For 100M rows, 10 targets:
- Single target kernel: 1.3s
- Multi-target kernel: 0.9s
However, the end-to-end query times did not improve. The cumulative time spent in strings::contains() was very high with the multi-kernel:
- Single target kernel: 498ms
- Multi-target kernel: 1843ms
I've removed the call to thrust::fill. Per advice from @nvdbaranec and @abellina, I've switched to using rmm::device_uvector instead of io::hostdevice_vector, and turned the pinned-pool allocator on. There have been improvements:
| Algo | Device datastructure | Pinned Pool setting | Avg Kernel time (ms) | Total Kernel Time (ms) | Total function time (ms) |
|---|---|---|---|---|---|
| Baseline (1 target/kernel x 10 targets) | - | 0g | 0.545 | 1091 | 498 |
| Baseline (1 target/kernel x 10 targets) | - | 8g | 0.565 | 1131 | 562.59 |
| Multi Kernel (10 targets/kernel) | io::hostdev_vector | 0g | 4.61 | 922.05 | 1885 |
| Multi Kernel (10 targets/kernel) | io::hostdev_vector | 8g | 4.69 | 938.167 | 252.191 |
| Multi Kernel (10 targets/kernel) | rmm::device_vector | 0g | 4.604 | 920.875 | 389.096 |
| Multi Kernel (10 targets/kernel) | rmm::device_vector | 8g | 4.7 | 939.957 | 231.817 |
The gains in the total kernel time remain about the same:
But there is reasonable improvement in the time spent in cudf::strings::contains:
It looks like we're closer to the solution. It might warrant more testing with the user's actual data.
From my experimentation, it appears that we might have taken this nearly as far as we can with this algo. There will be incremental improvements if we allocate the multiple targets in a contiguous allocation, and space out the indices for consecutive threads.
I will try make those changes available in this PR shortly.
I've been chasing another lead with the approach detailed in this paper. This proposes to implement KMP or Aho Corasick on CUDA. It's out of scope for this PR, but it's a promising option.
I've moved this over to 24.08, rather than rush this to 24.06. There is a parallel effort on to explore KMP or Aho-Corasick, to see if that fits better.
I've been unable to spend time on this.
I'm closing this for now. I'll try hit this again later, with the modified KMP approach.