strings::contains() for multiple scalar search targets
Description
This is based on https://github.com/rapidsai/cudf/pull/15536/ Added three optimizations:
- For short strings, handles multiple targets for a string in one thread to improve memory access. For each index of the string, sequentially search each target.
for (auto str_byte_idx = 0; str_byte_idx < d_str.size_bytes(); ++str_byte_idx) { // iterate the start index in the string
for (auto target_idx = 0; target_idx < num_targets; ++target_idx) { // iterate the targets
- For long strings, leverage the warp parallel approach, but instead of one target in a warp, this PR handles multiple targets in a warp. This also aims to improve memory access.
for (size_t target_idx = 0; target_idx < num_targets; target_idx++) {
for (auto i = lane_idx; ...... ; i += cudf::detail::warp_size) {
- Index the first chars in the targets This makes the searching for short strings(<=64) very fast.
/**
* Execute multi contains for short strings
* First index the first char for all targets.
* Index the first char:
* collect first char for all targets and do uniq and sort,
* then index the targets for the first char.
* e.g.:
* targets: xa xb ac ad af
* first char set is: (a, x)
* index result is:
* {
* a: [2, 3, 4], // indexes for: ac ad af, [2,3,4] is the target indexes
* x: [0, 1] // indexes for: xa xb, [0, 1] is the target indexes
* }
* when do searching:
* find (binary search) from `first char set` for a char in string:
* if char in string is not in ['a', 'x'], fast skip
* if char in string is 'x', then only need to try ["xa", "xb"] targets.
* if char in string is 'a', then only need to try ["ac", "ad", "af"] targets.
*
*/
In this way, when checking the first char in a string for all targets, previously we need to compare n times.
After this change, we only need log(n) times by using binary search.
Original:
for c in string:
for target in targets:
// compare the first char
...
// compare the 2nd ~ end char.
Now:
for c in string:
// compare the first char by binary search
int[] first_char_matched_targets = binary_search(firs_char_set_in_targets)
for (target in first_char_matched_targets) {
// compare the 2nd ~ end char.
}
...
Checklist
- [x] I am familiar with the Contributing Guidelines.
- [x] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.
This pull request requires additional validation before any workflows can run on NVIDIA's runners.
Pull request vetters can view their responsibilities here.
Contributors can view more details about this message here.
I got this result for benchmark test:
| api | row_width | num_rows | hit_rate | chars_size | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s | GlobalMem BW | BWUtil |
|--------|-----------|----------|----------|------------|---------|-----------|--------|-----------|-------|--------|--------------|--------|
| origin | 32 | 1953000 | 20 | 51065308 | 832x | 11.784 ms | 10.02% | 11.777 ms | 9.95% | 4.336G | 4.336 GB/s | 0.65% |
| new | 32 | 1953000 | 20 | 51065308 | 1328x | 10.530 ms | 9.27% | 10.503 ms | 8.77% | 4.862G | 4.862 GB/s | 0.72% |
| origin | 64 | 1953000 | 20 | 102130616 | 384x | 39.089 ms | 2.48% | 39.084 ms | 2.48% | 2.613G | 2.613 GB/s | 0.39% |
| new | 64 | 1953000 | 20 | 102130616 | 579x | 25.887 ms | 8.45% | 25.851 ms | 8.25% | 3.951G | 3.951 GB/s | 0.59% |
| origin | 128 | 1953000 | 20 | 204261232 | 383x | 39.168 ms | 8.19% | 39.143 ms | 8.17% | 5.218G | 5.218 GB/s | 0.78% |
| new | 128 | 1953000 | 20 | 204261232 | 483x | 31.051 ms | 7.44% | 31.020 ms | 7.27% | 6.585G | 6.585 GB/s | 0.98% |
| origin | 32 | 1953000 | 80 | 59640595 | 624x | 14.563 ms | 9.22% | 14.556 ms | 9.14% | 4.097G | 4.097 GB/s | 0.61% |
| new | 32 | 1953000 | 80 | 59640595 | 640x | 11.646 ms | 7.77% | 11.614 ms | 7.14% | 5.135G | 5.135 GB/s | 0.76% |
| origin | 64 | 1953000 | 80 | 119281190 | 378x | 39.727 ms | 4.76% | 39.710 ms | 4.72% | 3.004G | 3.004 GB/s | 0.45% |
| new | 64 | 1953000 | 80 | 119281190 | 730x | 20.524 ms | 5.80% | 20.505 ms | 5.63% | 5.817G | 5.817 GB/s | 0.87% |
| origin | 128 | 1953000 | 80 | 238562380 | 375x | 40.057 ms | 5.65% | 40.050 ms | 5.65% | 5.957G | 5.957 GB/s | 0.89% |
| new | 128 | 1953000 | 80 | 238562380 | 449x | 33.396 ms | 5.87% | 33.367 ms | 5.81% | 7.150G | 7.150 GB/s | 1.06% |
origin: call contains single target multiple times.
new: single call to handle multiple targets.
We get about 1.2x ~ 2x speed.
/ok to test
/ok to test
I'd better cc @davidwendt, whom I consulted as part of #15536. I'd feel a little better about 👍-ing my own work if he reviewed as well. :]
Thank you for taking this forward, @res-life.
Will the comments from #15536 be addressed here? For example: https://github.com/rapidsai/cudf/pull/15536#discussion_r1566415863 which is also referenced here https://github.com/rapidsai/cudf/pull/15536#discussion_r1593959958
20 targets:
| api | row_width | num_rows | hit_rate | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|----------|-------------|------------|------------|------------|-------------|------------|-------------|----------------|---------|----------|
| contains | 32 | 260000 | 20 | 1.985 ms | 11.18% | 1.468 ms | 4.98% | -516.649 us | -26.03% | FAIL |
| contains | 64 | 260000 | 20 | 5.034 ms | 9.42% | 3.428 ms | 7.84% | -1605.743 us | -31.90% | FAIL |
| contains | 128 | 260000 | 20 | 5.279 ms | 5.30% | 3.341 ms | 6.38% | -1938.129 us | -36.71% | FAIL |
| contains | 256 | 260000 | 20 | 7.924 ms | 39.80% | 5.066 ms | 8.14% | -2858.295 us | -36.07% | FAIL |
| contains | 512 | 260000 | 20 | 12.412 ms | 4.77% | 10.131 ms | 28.95% | -2281.325 us | -18.38% | FAIL |
| contains | 1024 | 260000 | 20 | 21.000 ms | 2.64% | 15.017 ms | 5.66% | -5983.068 us | -28.49% | FAIL |
| contains | 32 | 1953000 | 20 | 11.008 ms | 3.06% | 11.388 ms | 4.74% | 380.332 us | 3.46% | FAIL |
| contains | 64 | 1953000 | 20 | 32.544 ms | 2.02% | 25.464 ms | 2.68% | -7080.123 us | -21.76% | FAIL |
| contains | 128 | 1953000 | 20 | 37.533 ms | 1.96% | 25.860 ms | 2.63% | -11673.166 us | -31.10% | FAIL |
| contains | 256 | 1953000 | 20 | 55.251 ms | 1.26% | 38.541 ms | 1.12% | -16710.174 us | -30.24% | FAIL |
| contains | 512 | 1953000 | 20 | 89.551 ms | 1.25% | 64.981 ms | 0.36% | -24569.331 us | -27.44% | FAIL |
| contains | 1024 | 1953000 | 20 | 156.897 ms | 0.48% | 113.479 ms | 0.28% | -43417.619 us | -27.67% | FAIL |
| contains | 32 | 16777216 | 20 | 92.424 ms | 0.82% | 95.961 ms | 0.35% | 3.537 ms | 3.83% | FAIL |
| contains | 64 | 16777216 | 20 | 281.194 ms | 0.49% | 220.198 ms | 2.10% | -60995.860 us | -21.69% | FAIL |
| contains | 32 | 260000 | 80 | 2.036 ms | 11.27% | 1.155 ms | 3.22% | -880.620 us | -43.26% | FAIL |
| contains | 64 | 260000 | 80 | 4.863 ms | 6.16% | 2.219 ms | 5.78% | -2643.471 us | -54.36% | FAIL |
| contains | 128 | 260000 | 80 | 5.573 ms | 9.79% | 3.674 ms | 4.58% | -1899.006 us | -34.08% | FAIL |
| contains | 256 | 260000 | 80 | 7.964 ms | 29.06% | 5.371 ms | 4.58% | -2593.164 us | -32.56% | FAIL |
| contains | 512 | 260000 | 80 | 12.110 ms | 3.45% | 8.800 ms | 4.13% | -3310.160 us | -27.33% | FAIL |
| contains | 1024 | 260000 | 80 | 21.181 ms | 7.49% | 15.685 ms | 4.33% | -5496.013 us | -25.95% | FAIL |
| contains | 32 | 1953000 | 80 | 10.699 ms | 7.82% | 8.074 ms | 3.91% | -2625.455 us | -24.54% | FAIL |
| contains | 64 | 1953000 | 80 | 30.158 ms | 3.32% | 15.044 ms | 4.32% | -15114.107 us | -50.12% | FAIL |
| contains | 128 | 1953000 | 80 | 38.119 ms | 1.71% | 27.495 ms | 1.25% | -10623.225 us | -27.87% | FAIL |
| contains | 256 | 1953000 | 80 | 55.647 ms | 2.55% | 40.471 ms | 5.45% | -15176.148 us | -27.27% | FAIL |
| contains | 512 | 1953000 | 80 | 88.548 ms | 1.68% | 65.727 ms | 1.51% | -22820.777 us | -25.77% | FAIL |
| contains | 1024 | 1953000 | 80 | 154.357 ms | 1.06% | 116.059 ms | 0.49% | -38297.609 us | -24.81% | FAIL |
| contains | 32 | 16777216 | 80 | 85.989 ms | 2.62% | 69.670 ms | 3.14% | -16319.696 us | -18.98% | FAIL |
| contains | 64 | 16777216 | 80 | 257.486 ms | 2.05% | 126.210 ms | 1.15% | -131276.699 us | -50.98% | FAIL |
When row_width is 32, did not get improvement.
10 targets, the first chars in the targets are not same:
compared to no combination:
| api | row_width | num_rows | hit_rate | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|----------|-------------|------------|------------|------------|-------------|------------|-------------|----------------|---------|----------|
| contains | 32 | 260000 | 20 | 1.244 ms | 63.37% | 294.277 us | 75.53% | -950.209 us | -76.35% | FAIL |
| contains | 64 | 260000 | 20 | 3.193 ms | 35.95% | 559.460 us | 60.60% | -2633.538 us | -82.48% | FAIL |
| contains | 128 | 260000 | 20 | 2.819 ms | 36.12% | 1.825 ms | 21.90% | -993.638 us | -35.25% | FAIL |
| contains | 256 | 260000 | 20 | 3.915 ms | 32.72% | 3.005 ms | 20.61% | -909.553 us | -23.23% | FAIL |
| contains | 512 | 260000 | 20 | 6.185 ms | 22.04% | 5.163 ms | 18.13% | -1022.705 us | -16.53% | PASS |
| contains | 1024 | 260000 | 20 | 11.011 ms | 17.22% | 9.578 ms | 15.96% | -1432.585 us | -13.01% | PASS |
| contains | 32 | 1953000 | 20 | 6.046 ms | 21.89% | 1.636 ms | 25.88% | -4410.140 us | -72.94% | FAIL |
| contains | 64 | 1953000 | 20 | 19.721 ms | 16.33% | 3.545 ms | 19.97% | -16176.496 us | -82.02% | FAIL |
| contains | 128 | 1953000 | 20 | 19.356 ms | 12.36% | 14.552 ms | 13.23% | -4804.639 us | -24.82% | FAIL |
| contains | 256 | 1953000 | 20 | 28.436 ms | 11.44% | 23.423 ms | 12.67% | -5013.515 us | -17.63% | FAIL |
| contains | 512 | 1953000 | 20 | 46.872 ms | 8.88% | 40.680 ms | 9.66% | -6192.678 us | -13.21% | FAIL |
| contains | 1024 | 1953000 | 20 | 83.452 ms | 5.72% | 76.338 ms | 9.13% | -7113.844 us | -8.52% | FAIL |
| contains | 32 | 16777216 | 20 | 53.078 ms | 12.70% | 14.976 ms | 13.39% | -38102.392 us | -71.79% | FAIL |
| contains | 64 | 16777216 | 20 | 167.913 ms | 4.22% | 32.235 ms | 11.10% | -135677.562 us | -80.80% | FAIL |
| contains | 32 | 260000 | 80 | 1.155 ms | 60.31% | 277.738 us | 71.88% | -877.466 us | -75.96% | FAIL |
| contains | 64 | 260000 | 80 | 2.814 ms | 34.77% | 559.051 us | 64.44% | -2255.042 us | -80.13% | FAIL |
| contains | 128 | 260000 | 80 | 2.915 ms | 29.19% | 2.118 ms | 27.70% | -797.019 us | -27.34% | PASS |
| contains | 256 | 260000 | 80 | 4.104 ms | 27.16% | 3.163 ms | 20.73% | -940.996 us | -22.93% | FAIL |
| contains | 512 | 260000 | 80 | 6.470 ms | 20.55% | 5.179 ms | 14.62% | -1290.758 us | -19.95% | FAIL |
| contains | 1024 | 260000 | 80 | 11.254 ms | 16.07% | 9.507 ms | 15.62% | -1747.422 us | -15.53% | PASS |
| contains | 32 | 1953000 | 80 | 6.216 ms | 21.56% | 1.632 ms | 33.69% | -4584.014 us | -73.74% | FAIL |
| contains | 64 | 1953000 | 80 | 17.302 ms | 11.75% | 3.446 ms | 21.78% | -13856.063 us | -80.08% | FAIL |
| contains | 128 | 1953000 | 80 | 20.161 ms | 10.77% | 16.085 ms | 13.76% | -4075.838 us | -20.22% | FAIL |
| contains | 256 | 1953000 | 80 | 29.369 ms | 10.06% | 24.060 ms | 12.07% | -5309.288 us | -18.08% | FAIL |
| contains | 512 | 1953000 | 80 | 47.267 ms | 7.39% | 39.604 ms | 9.20% | -7663.106 us | -16.21% | FAIL |
| contains | 1024 | 1953000 | 80 | 83.477 ms | 5.60% | 71.080 ms | 6.14% | -12397.108 us | -14.85% | FAIL |
| contains | 32 | 16777216 | 80 | 51.649 ms | 7.60% | 13.768 ms | 11.78% | -37880.762 us | -73.34% | FAIL |
| contains | 64 | 16777216 | 80 | 144.361 ms | 3.88% | 29.533 ms | 10.26% | -114827.945 us | -79.54% | FAIL |
10 targets, some of the first chars in the targets are same:
compared to no combination:
| api | row_width | num_rows | hit_rate | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|----------|-------------|------------|------------|------------|-------------|------------|-------------|----------------|---------|----------|
| contains | 32 | 260000 | 20 | 1.244 ms | 63.37% | 324.706 us | 75.61% | -919.780 us | -73.91% | FAIL |
| contains | 64 | 260000 | 20 | 3.193 ms | 35.95% | 624.857 us | 48.86% | -2568.141 us | -80.43% | FAIL |
| contains | 128 | 260000 | 20 | 2.819 ms | 36.12% | 1.771 ms | 22.61% | -1047.634 us | -37.17% | FAIL |
| contains | 256 | 260000 | 20 | 3.915 ms | 32.72% | 2.786 ms | 18.25% | -1128.735 us | -28.83% | FAIL |
| contains | 512 | 260000 | 20 | 6.185 ms | 22.04% | 4.738 ms | 14.59% | -1447.159 us | -23.40% | FAIL |
| contains | 1024 | 260000 | 20 | 11.011 ms | 17.22% | 8.692 ms | 11.25% | -2318.846 us | -21.06% | FAIL |
| contains | 32 | 1953000 | 20 | 6.046 ms | 21.89% | 1.880 ms | 26.32% | -4166.389 us | -68.91% | FAIL |
| contains | 64 | 1953000 | 20 | 19.721 ms | 16.33% | 4.047 ms | 17.52% | -15674.781 us | -79.48% | FAIL |
| contains | 128 | 1953000 | 20 | 19.356 ms | 12.36% | 13.938 ms | 13.10% | -5418.027 us | -27.99% | FAIL |
| contains | 256 | 1953000 | 20 | 28.436 ms | 11.44% | 21.606 ms | 10.63% | -6830.425 us | -24.02% | FAIL |
| contains | 512 | 1953000 | 20 | 46.872 ms | 8.88% | 37.080 ms | 8.94% | -9792.225 us | -20.89% | FAIL |
| contains | 1024 | 1953000 | 20 | 83.452 ms | 5.72% | 67.938 ms | 6.43% | -15514.469 us | -18.59% | FAIL |
| contains | 32 | 16777216 | 20 | 53.078 ms | 12.70% | 16.512 ms | 11.00% | -36566.269 us | -68.89% | FAIL |
| contains | 64 | 16777216 | 20 | 167.913 ms | 4.22% | 35.248 ms | 8.00% | -132664.920 us | -79.01% | FAIL |
| contains | 32 | 260000 | 80 | 1.155 ms | 60.31% | 261.748 us | 68.52% | -893.456 us | -77.34% | FAIL |
| contains | 64 | 260000 | 80 | 2.814 ms | 34.77% | 553.510 us | 61.02% | -2260.583 us | -80.33% | FAIL |
| contains | 128 | 260000 | 80 | 2.915 ms | 29.19% | 2.036 ms | 22.44% | -878.853 us | -30.15% | FAIL |
| contains | 256 | 260000 | 80 | 4.104 ms | 27.16% | 3.046 ms | 19.13% | -1058.541 us | -25.79% | FAIL |
| contains | 512 | 260000 | 80 | 6.470 ms | 20.55% | 5.027 ms | 16.40% | -1443.053 us | -22.30% | FAIL |
| contains | 1024 | 260000 | 80 | 11.254 ms | 16.07% | 9.094 ms | 13.87% | -2159.828 us | -19.19% | FAIL |
| contains | 32 | 1953000 | 80 | 6.216 ms | 21.56% | 1.487 ms | 28.11% | -4729.836 us | -76.09% | FAIL |
| contains | 64 | 1953000 | 80 | 17.302 ms | 11.75% | 3.235 ms | 18.19% | -14067.958 us | -81.31% | FAIL |
| contains | 128 | 1953000 | 80 | 20.161 ms | 10.77% | 15.421 ms | 11.62% | -4739.859 us | -23.51% | FAIL |
| contains | 256 | 1953000 | 80 | 29.369 ms | 10.06% | 23.015 ms | 11.10% | -6354.241 us | -21.64% | FAIL |
| contains | 512 | 1953000 | 80 | 47.267 ms | 7.39% | 38.118 ms | 9.01% | -9148.888 us | -19.36% | FAIL |
| contains | 1024 | 1953000 | 80 | 83.477 ms | 5.60% | 68.878 ms | 5.94% | -14599.543 us | -17.49% | FAIL |
| contains | 32 | 16777216 | 80 | 51.649 ms | 7.60% | 12.519 ms | 11.04% | -39130.076 us | -75.76% | FAIL |
| contains | 64 | 16777216 | 80 | 144.361 ms | 3.88% | 27.425 ms | 10.84% | -116936.242 us | -81.00% | FAIL |
4 targets, the first chars in the targets are not same:
compared to no combination:
| api | row_width | num_rows | hit_rate | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|----------|-------------|------------|------------|------------|-------------|------------|-------------|---------------|---------|----------|
| contains | 32 | 260000 | 20 | 503.839 us | 84.45% | 213.685 us | 94.82% | -290.154 us | -57.59% | PASS |
| contains | 64 | 260000 | 20 | 1.281 ms | 54.03% | 388.771 us | 65.15% | -891.907 us | -69.64% | FAIL |
| contains | 128 | 260000 | 20 | 1.113 ms | 50.36% | 843.211 us | 52.15% | -270.254 us | -24.27% | PASS |
| contains | 256 | 260000 | 20 | 1.577 ms | 33.38% | 1.339 ms | 44.40% | -238.296 us | -15.11% | PASS |
| contains | 512 | 260000 | 20 | 2.511 ms | 31.78% | 2.227 ms | 25.50% | -284.381 us | -11.32% | PASS |
| contains | 1024 | 260000 | 20 | 4.496 ms | 28.19% | 4.117 ms | 18.04% | -379.538 us | -8.44% | PASS |
| contains | 32 | 1953000 | 20 | 2.447 ms | 42.15% | 1.063 ms | 29.35% | -1383.647 us | -56.55% | FAIL |
| contains | 64 | 1953000 | 20 | 7.663 ms | 18.15% | 2.357 ms | 20.85% | -5305.013 us | -69.23% | FAIL |
| contains | 128 | 1953000 | 20 | 7.556 ms | 16.53% | 6.073 ms | 14.06% | -1482.334 us | -19.62% | FAIL |
| contains | 256 | 1953000 | 20 | 11.284 ms | 16.62% | 9.761 ms | 11.03% | -1523.614 us | -13.50% | FAIL |
| contains | 512 | 1953000 | 20 | 19.124 ms | 16.26% | 17.371 ms | 12.46% | -1753.140 us | -9.17% | PASS |
| contains | 1024 | 1953000 | 20 | 34.605 ms | 12.72% | 32.262 ms | 9.50% | -2342.325 us | -6.77% | PASS |
| contains | 32 | 16777216 | 20 | 20.473 ms | 12.14% | 9.471 ms | 14.16% | -11002.095 us | -53.74% | FAIL |
| contains | 64 | 16777216 | 20 | 67.360 ms | 5.87% | 20.640 ms | 10.20% | -46720.033 us | -69.36% | FAIL |
| contains | 32 | 260000 | 80 | 448.428 us | 64.31% | 182.099 us | 71.09% | -266.328 us | -59.39% | PASS |
| contains | 64 | 260000 | 80 | 1.114 ms | 49.31% | 366.487 us | 53.61% | -747.846 us | -67.11% | FAIL |
| contains | 128 | 260000 | 80 | 1.179 ms | 35.31% | 907.171 us | 33.73% | -272.086 us | -23.07% | PASS |
| contains | 256 | 260000 | 80 | 1.642 ms | 35.77% | 1.317 ms | 23.22% | -324.419 us | -19.76% | PASS |
| contains | 512 | 260000 | 80 | 2.580 ms | 28.13% | 2.230 ms | 28.56% | -350.436 us | -13.58% | PASS |
| contains | 1024 | 260000 | 80 | 4.392 ms | 20.20% | 4.059 ms | 20.67% | -333.550 us | -7.59% | PASS |
| contains | 32 | 1953000 | 80 | 2.374 ms | 33.82% | 957.400 us | 51.44% | -1417.018 us | -59.68% | FAIL |
| contains | 64 | 1953000 | 80 | 6.694 ms | 16.69% | 2.115 ms | 25.07% | -4578.272 us | -68.40% | FAIL |
| contains | 128 | 1953000 | 80 | 7.988 ms | 13.46% | 6.736 ms | 13.49% | -1251.784 us | -15.67% | FAIL |
| contains | 256 | 1953000 | 80 | 11.841 ms | 16.46% | 9.951 ms | 13.03% | -1889.500 us | -15.96% | FAIL |
| contains | 512 | 1953000 | 80 | 19.162 ms | 14.74% | 16.554 ms | 12.89% | -2608.800 us | -13.61% | FAIL |
| contains | 1024 | 1953000 | 80 | 32.791 ms | 10.75% | 30.253 ms | 10.50% | -2538.488 us | -7.74% | PASS |
| contains | 32 | 16777216 | 80 | 19.463 ms | 13.25% | 7.682 ms | 13.65% | -11781.065 us | -60.53% | FAIL |
| contains | 64 | 16777216 | 80 | 56.855 ms | 7.25% | 18.265 ms | 12.82% | -38589.662 us | -67.87% | FAIL |
4 targets, some of the first chars in the targets are same:
compared to no combination:
| api | row_width | num_rows | hit_rate | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|----------|-------------|------------|------------|------------|-------------|------------|-------------|---------------|---------|----------|
| contains | 32 | 260000 | 20 | 503.839 us | 84.45% | 199.724 us | 101.08% | -304.115 us | -60.36% | PASS |
| contains | 64 | 260000 | 20 | 1.281 ms | 54.03% | 364.286 us | 64.61% | -916.392 us | -71.56% | FAIL |
| contains | 128 | 260000 | 20 | 1.113 ms | 50.36% | 733.357 us | 44.71% | -380.108 us | -34.14% | PASS |
| contains | 256 | 260000 | 20 | 1.577 ms | 33.38% | 1.106 ms | 29.95% | -471.453 us | -29.89% | PASS |
| contains | 512 | 260000 | 20 | 2.511 ms | 31.78% | 1.869 ms | 27.68% | -642.373 us | -25.58% | PASS |
| contains | 1024 | 260000 | 20 | 4.496 ms | 28.19% | 3.549 ms | 25.47% | -947.100 us | -21.06% | PASS |
| contains | 32 | 1953000 | 20 | 2.447 ms | 42.15% | 971.573 us | 35.62% | -1475.034 us | -60.29% | FAIL |
| contains | 64 | 1953000 | 20 | 7.663 ms | 18.15% | 2.117 ms | 10.91% | -5545.202 us | -72.37% | FAIL |
| contains | 128 | 1953000 | 20 | 7.556 ms | 16.53% | 5.499 ms | 13.78% | -2056.349 us | -27.22% | FAIL |
| contains | 256 | 1953000 | 20 | 11.284 ms | 16.62% | 8.469 ms | 12.70% | -2815.805 us | -24.95% | FAIL |
| contains | 512 | 1953000 | 20 | 19.124 ms | 16.26% | 14.610 ms | 12.79% | -4513.816 us | -23.60% | FAIL |
| contains | 1024 | 1953000 | 20 | 34.605 ms | 12.72% | 27.296 ms | 12.87% | -7308.708 us | -21.12% | FAIL |
| contains | 32 | 16777216 | 20 | 20.473 ms | 12.14% | 8.431 ms | 12.77% | -12041.673 us | -58.82% | FAIL |
| contains | 64 | 16777216 | 20 | 67.360 ms | 5.87% | 19.317 ms | 13.87% | -48043.704 us | -71.32% | FAIL |
| contains | 32 | 260000 | 80 | 448.428 us | 64.31% | 166.669 us | 107.95% | -281.759 us | -62.83% | PASS |
| contains | 64 | 260000 | 80 | 1.114 ms | 49.31% | 343.925 us | 68.33% | -770.408 us | -69.14% | FAIL |
| contains | 128 | 260000 | 80 | 1.179 ms | 35.31% | 836.526 us | 35.20% | -342.731 us | -29.06% | PASS |
| contains | 256 | 260000 | 80 | 1.642 ms | 35.77% | 1.235 ms | 26.46% | -407.101 us | -24.80% | PASS |
| contains | 512 | 260000 | 80 | 2.580 ms | 28.13% | 2.036 ms | 20.89% | -544.289 us | -21.10% | FAIL |
| contains | 1024 | 260000 | 80 | 4.392 ms | 20.20% | 3.777 ms | 17.76% | -615.245 us | -14.01% | PASS |
| contains | 32 | 1953000 | 80 | 2.374 ms | 33.82% | 821.901 us | 42.79% | -1552.517 us | -65.39% | FAIL |
| contains | 64 | 1953000 | 80 | 6.694 ms | 16.69% | 1.935 ms | 26.53% | -4758.834 us | -71.09% | FAIL |
| contains | 128 | 1953000 | 80 | 7.988 ms | 13.46% | 6.191 ms | 14.31% | -1796.902 us | -22.49% | FAIL |
| contains | 256 | 1953000 | 80 | 11.841 ms | 16.46% | 9.332 ms | 13.13% | -2508.756 us | -21.19% | FAIL |
| contains | 512 | 1953000 | 80 | 19.162 ms | 14.74% | 15.624 ms | 11.71% | -3537.965 us | -18.46% | FAIL |
| contains | 1024 | 1953000 | 80 | 32.791 ms | 10.75% | 28.783 ms | 10.66% | -4007.760 us | -12.22% | FAIL |
| contains | 32 | 16777216 | 80 | 19.463 ms | 13.25% | 6.751 ms | 14.91% | -12712.142 us | -65.32% | FAIL |
| contains | 64 | 16777216 | 80 | 56.855 ms | 7.25% | 16.452 ms | 10.76% | -40402.235 us | -71.06% | FAIL |
/ok to test
If you could paste the performance benchmarks with the triple-hash-ticks in the comments, it would make the tables a bit easier to read in the browser. I've update the ones you pasted before this comment.
/ok to test
/ok to test
/ok to test
/ok to test
After this commit Optimize warp parallel
Got good perf improvement for longs strings(using warp parallel)
| api | has_duplicated_targets | num_targets | row_width | num_rows | hit_rate | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|----------|--------------------------|---------------|-------------|------------|------------|------------|-------------|------------|-------------|----------------|---------|----------|
| contains | true | 4 | 32 | 260000 | 20 | 491.808 us | 39.94% | 177.894 us | 44.07% | -313.914 us | -63.83% | FAIL |
| contains | false | 4 | 32 | 260000 | 20 | 447.867 us | 39.09% | 187.448 us | 37.29% | -260.419 us | -58.15% | FAIL |
| contains | true | 10 | 32 | 260000 | 20 | 1.102 ms | 34.17% | 262.423 us | 22.23% | -840.015 us | -76.20% | FAIL |
| contains | false | 10 | 32 | 260000 | 20 | 1.162 ms | 74.40% | 250.545 us | 32.03% | -911.606 us | -78.44% | FAIL |
| contains | true | 4 | 64 | 260000 | 20 | 1.189 ms | 23.35% | 329.508 us | 21.28% | -859.115 us | -72.28% | FAIL |
| contains | false | 4 | 64 | 260000 | 20 | 1.196 ms | 27.48% | 351.100 us | 19.57% | -845.209 us | -70.65% | FAIL |
| contains | true | 10 | 64 | 260000 | 20 | 2.955 ms | 18.27% | 528.739 us | 19.73% | -2426.027 us | -82.11% | FAIL |
| contains | false | 10 | 64 | 260000 | 20 | 2.986 ms | 19.98% | 490.192 us | 23.77% | -2495.849 us | -83.58% | FAIL |
| contains | true | 4 | 128 | 260000 | 20 | 1.031 ms | 21.85% | 107.434 us | 48.58% | -923.617 us | -89.58% | FAIL |
| contains | false | 4 | 128 | 260000 | 20 | 1.082 ms | 21.28% | 109.176 us | 56.96% | -972.816 us | -89.91% | FAIL |
| contains | true | 10 | 128 | 260000 | 20 | 2.704 ms | 17.90% | 170.184 us | 46.46% | -2534.223 us | -93.71% | FAIL |
| contains | false | 10 | 128 | 260000 | 20 | 2.722 ms | 18.74% | 165.549 us | 48.78% | -2556.935 us | -93.92% | FAIL |
| contains | true | 4 | 256 | 260000 | 20 | 1.489 ms | 18.51% | 126.164 us | 51.05% | -1362.652 us | -91.53% | FAIL |
| contains | false | 4 | 256 | 260000 | 20 | 1.551 ms | 18.77% | 130.849 us | 44.42% | -1420.629 us | -91.57% | FAIL |
| contains | true | 10 | 256 | 260000 | 20 | 3.925 ms | 17.14% | 195.054 us | 40.52% | -3730.333 us | -95.03% | FAIL |
| contains | false | 10 | 256 | 260000 | 20 | 3.975 ms | 18.84% | 183.471 us | 46.25% | -3791.508 us | -95.38% | FAIL |
| contains | true | 4 | 512 | 260000 | 20 | 2.437 ms | 15.84% | 163.460 us | 50.12% | -2273.846 us | -93.29% | FAIL |
| contains | false | 4 | 512 | 260000 | 20 | 2.524 ms | 15.38% | 165.471 us | 50.28% | -2358.526 us | -93.44% | FAIL |
| contains | true | 10 | 512 | 260000 | 20 | 6.334 ms | 11.63% | 244.177 us | 37.41% | -6090.222 us | -96.15% | FAIL |
| contains | false | 10 | 512 | 260000 | 20 | 6.333 ms | 11.54% | 226.260 us | 38.18% | -6106.959 us | -96.43% | FAIL |
| contains | true | 4 | 1024 | 260000 | 20 | 4.224 ms | 10.77% | 237.020 us | 31.24% | -3987.161 us | -94.39% | FAIL |
| contains | false | 4 | 1024 | 260000 | 20 | 4.631 ms | 18.43% | 243.099 us | 27.34% | -4387.587 us | -94.75% | FAIL |
| contains | true | 10 | 1024 | 260000 | 20 | 11.229 ms | 7.87% | 353.633 us | 22.62% | -10875.385 us | -96.85% | FAIL |
| contains | false | 10 | 1024 | 260000 | 20 | 11.065 ms | 7.96% | 317.475 us | 31.48% | -10747.774 us | -97.13% | FAIL |
| contains | true | 4 | 32 | 1953000 | 20 | 2.516 ms | 17.03% | 908.590 us | 14.46% | -1607.368 us | -63.89% | FAIL |
| contains | false | 4 | 32 | 1953000 | 20 | 2.449 ms | 17.01% | 1.006 ms | 12.47% | -1442.907 us | -58.92% | FAIL |
| contains | true | 10 | 32 | 1953000 | 20 | 6.076 ms | 10.50% | 1.580 ms | 10.51% | -4496.088 us | -74.00% | FAIL |
| contains | false | 10 | 32 | 1953000 | 20 | 6.106 ms | 10.13% | 1.492 ms | 10.72% | -4614.225 us | -75.57% | FAIL |
| contains | true | 4 | 64 | 1953000 | 20 | 8.019 ms | 7.12% | 1.993 ms | 9.24% | -6026.227 us | -75.15% | FAIL |
| contains | false | 4 | 64 | 1953000 | 20 | 7.910 ms | 7.58% | 2.197 ms | 13.17% | -5713.501 us | -72.23% | FAIL |
| contains | true | 10 | 64 | 1953000 | 20 | 19.660 ms | 5.08% | 3.473 ms | 11.89% | -16187.161 us | -82.33% | FAIL |
| contains | false | 10 | 64 | 1953000 | 20 | 19.679 ms | 4.89% | 3.274 ms | 16.41% | -16405.000 us | -83.36% | FAIL |
| contains | true | 4 | 128 | 1953000 | 20 | 7.192 ms | 7.55% | 386.374 us | 29.06% | -6805.974 us | -94.63% | FAIL |
| contains | false | 4 | 128 | 1953000 | 20 | 7.505 ms | 5.81% | 386.765 us | 25.93% | -7118.219 us | -94.85% | FAIL |
| contains | true | 10 | 128 | 1953000 | 20 | 18.753 ms | 4.26% | 711.342 us | 15.99% | -18042.137 us | -96.21% | FAIL |
| contains | false | 10 | 128 | 1953000 | 20 | 18.701 ms | 4.12% | 681.406 us | 14.89% | -18020.067 us | -96.36% | FAIL |
| contains | true | 4 | 256 | 1953000 | 20 | 10.664 ms | 6.11% | 512.822 us | 24.27% | -10151.460 us | -95.19% | FAIL |
| contains | false | 4 | 256 | 1953000 | 20 | 11.089 ms | 5.81% | 521.197 us | 18.93% | -10567.751 us | -95.30% | FAIL |
| contains | true | 10 | 256 | 1953000 | 20 | 27.645 ms | 3.67% | 904.163 us | 18.61% | -26740.442 us | -96.73% | FAIL |
| contains | false | 10 | 256 | 1953000 | 20 | 27.474 ms | 3.80% | 840.781 us | 14.67% | -26633.283 us | -96.94% | FAIL |
| contains | true | 4 | 512 | 1953000 | 20 | 17.381 ms | 4.76% | 770.780 us | 14.85% | -16610.365 us | -95.57% | FAIL |
| contains | false | 4 | 512 | 1953000 | 20 | 18.266 ms | 4.63% | 799.207 us | 14.54% | -17466.740 us | -95.62% | FAIL |
| contains | true | 10 | 512 | 1953000 | 20 | 45.455 ms | 2.63% | 1.273 ms | 10.24% | -44182.276 us | -97.20% | FAIL |
| contains | false | 10 | 512 | 1953000 | 20 | 44.950 ms | 2.55% | 1.159 ms | 13.41% | -43791.057 us | -97.42% | FAIL |
| contains | true | 4 | 1024 | 1953000 | 20 | 30.730 ms | 3.34% | 1.285 ms | 11.07% | -29445.081 us | -95.82% | FAIL |
| contains | false | 4 | 1024 | 1953000 | 20 | 32.495 ms | 3.47% | 1.352 ms | 12.26% | -31143.256 us | -95.84% | FAIL |
| contains | true | 10 | 1024 | 1953000 | 20 | 81.010 ms | 1.21% | 2.019 ms | 8.70% | -78991.656 us | -97.51% | FAIL |
| contains | false | 10 | 1024 | 1953000 | 20 | 80.126 ms | 1.17% | 1.799 ms | 10.28% | -78326.454 us | -97.75% | FAIL |
| contains | true | 4 | 32 | 16777216 | 20 | 20.255 ms | 4.72% | 7.889 ms | 7.67% | -12366.231 us | -61.05% | FAIL |
| contains | false | 4 | 32 | 16777216 | 20 | 19.826 ms | 4.67% | 8.830 ms | 8.55% | -10996.830 us | -55.47% | FAIL |
| contains | true | 10 | 32 | 16777216 | 20 | 49.594 ms | 2.71% | 14.062 ms | 6.19% | -35531.640 us | -71.64% | FAIL |
| contains | false | 10 | 32 | 16777216 | 20 | 49.814 ms | 2.63% | 13.315 ms | 7.21% | -36499.052 us | -73.27% | FAIL |
| contains | true | 4 | 64 | 16777216 | 20 | 66.215 ms | 1.67% | 17.851 ms | 6.25% | -48364.244 us | -73.04% | FAIL |
| contains | false | 4 | 64 | 16777216 | 20 | 65.453 ms | 1.71% | 19.678 ms | 6.70% | -45774.805 us | -69.94% | FAIL |
| contains | true | 10 | 64 | 16777216 | 20 | 166.821 ms | 3.36% | 30.590 ms | 4.24% | -136230.946 us | -81.66% | FAIL |
| contains | false | 10 | 64 | 16777216 | 20 | 175.021 ms | 5.74% | 27.876 ms | 4.56% | -147145.080 us | -84.07% | FAIL |
| contains | true | 4 | 32 | 260000 | 80 | 465.476 us | 60.80% | 159.590 us | 43.75% | -305.886 us | -65.71% | FAIL |
| contains | false | 4 | 32 | 260000 | 80 | 439.163 us | 41.96% | 175.797 us | 35.37% | -263.366 us | -59.97% | FAIL |
| contains | true | 10 | 32 | 260000 | 80 | 1.068 ms | 36.62% | 246.616 us | 29.68% | -820.960 us | -76.90% | FAIL |
| contains | false | 10 | 32 | 260000 | 80 | 1.140 ms | 29.97% | 252.795 us | 20.87% | -886.970 us | -77.82% | FAIL |
| contains | true | 4 | 64 | 260000 | 80 | 1.104 ms | 24.93% | 327.742 us | 26.49% | -776.085 us | -70.31% | FAIL |
| contains | false | 4 | 64 | 260000 | 80 | 1.160 ms | 32.92% | 349.014 us | 24.68% | -811.371 us | -69.92% | FAIL |
| contains | true | 10 | 64 | 260000 | 80 | 2.868 ms | 25.95% | 500.693 us | 16.99% | -2367.279 us | -82.54% | FAIL |
| contains | false | 10 | 64 | 260000 | 80 | 2.952 ms | 16.49% | 495.460 us | 16.50% | -2456.074 us | -83.21% | FAIL |
| contains | true | 4 | 128 | 260000 | 80 | 1.127 ms | 24.82% | 117.624 us | 47.87% | -1009.086 us | -89.56% | FAIL |
| contains | false | 4 | 128 | 260000 | 80 | 1.166 ms | 24.40% | 113.397 us | 38.50% | -1052.938 us | -90.28% | FAIL |
| contains | true | 10 | 128 | 260000 | 80 | 2.925 ms | 18.74% | 177.758 us | 44.96% | -2747.692 us | -93.92% | FAIL |
| contains | false | 10 | 128 | 260000 | 80 | 2.924 ms | 17.64% | 172.309 us | 35.01% | -2752.006 us | -94.11% | FAIL |
| contains | true | 4 | 256 | 260000 | 80 | 1.628 ms | 24.45% | 138.432 us | 34.84% | -1489.300 us | -91.50% | FAIL |
| contains | false | 4 | 256 | 260000 | 80 | 1.640 ms | 19.86% | 132.031 us | 37.20% | -1508.445 us | -91.95% | FAIL |
| contains | true | 10 | 256 | 260000 | 80 | 4.051 ms | 15.03% | 202.121 us | 29.54% | -3848.950 us | -95.01% | FAIL |
| contains | false | 10 | 256 | 260000 | 80 | 4.072 ms | 15.23% | 193.882 us | 37.33% | -3878.175 us | -95.24% | FAIL |
| contains | true | 4 | 512 | 260000 | 80 | 2.503 ms | 14.63% | 182.269 us | 36.21% | -2320.971 us | -92.72% | FAIL |
| contains | false | 4 | 512 | 260000 | 80 | 2.537 ms | 13.85% | 172.554 us | 41.67% | -2364.006 us | -93.20% | FAIL |
| contains | true | 10 | 512 | 260000 | 80 | 6.327 ms | 11.10% | 250.446 us | 27.42% | -6076.351 us | -96.04% | FAIL |
| contains | false | 10 | 512 | 260000 | 80 | 6.425 ms | 11.42% | 236.335 us | 31.11% | -6188.394 us | -96.32% | FAIL |
| contains | true | 4 | 1024 | 260000 | 80 | 4.357 ms | 10.94% | 274.092 us | 22.29% | -4082.950 us | -93.71% | FAIL |
| contains | false | 4 | 1024 | 260000 | 80 | 4.388 ms | 11.18% | 252.370 us | 29.54% | -4136.034 us | -94.25% | FAIL |
| contains | true | 10 | 1024 | 260000 | 80 | 11.049 ms | 8.46% | 352.035 us | 22.85% | -10697.310 us | -96.81% | FAIL |
| contains | false | 10 | 1024 | 260000 | 80 | 11.158 ms | 8.27% | 329.846 us | 27.99% | -10828.209 us | -97.04% | FAIL |
| contains | true | 4 | 32 | 1953000 | 80 | 2.432 ms | 15.69% | 774.084 us | 13.86% | -1657.698 us | -68.17% | FAIL |
| contains | false | 4 | 32 | 1953000 | 80 | 2.381 ms | 17.02% | 876.659 us | 9.49% | -1504.832 us | -63.19% | FAIL |
| contains | true | 10 | 32 | 1953000 | 80 | 5.593 ms | 11.88% | 1.374 ms | 8.32% | -4219.759 us | -75.44% | FAIL |
| contains | false | 10 | 32 | 1953000 | 80 | 6.253 ms | 10.38% | 1.431 ms | 7.68% | -4821.960 us | -77.11% | FAIL |
| contains | true | 4 | 64 | 1953000 | 80 | 6.892 ms | 8.02% | 1.827 ms | 6.30% | -5064.806 us | -73.49% | FAIL |
| contains | false | 4 | 64 | 1953000 | 80 | 6.695 ms | 7.70% | 2.044 ms | 16.81% | -4650.699 us | -69.47% | FAIL |
| contains | true | 10 | 64 | 1953000 | 80 | 16.007 ms | 5.08% | 3.056 ms | 11.32% | -12951.351 us | -80.91% | FAIL |
| contains | false | 10 | 64 | 1953000 | 80 | 18.131 ms | 4.98% | 3.100 ms | 10.58% | -15031.497 us | -82.90% | FAIL |
| contains | true | 4 | 128 | 1953000 | 80 | 7.601 ms | 6.93% | 446.342 us | 16.90% | -7154.395 us | -94.13% | FAIL |
| contains | false | 4 | 128 | 1953000 | 80 | 8.001 ms | 6.53% | 422.714 us | 15.34% | -7578.128 us | -94.72% | FAIL |
| contains | true | 10 | 128 | 1953000 | 80 | 20.011 ms | 4.16% | 777.401 us | 10.50% | -19233.326 us | -96.12% | FAIL |
| contains | false | 10 | 128 | 1953000 | 80 | 19.947 ms | 3.56% | 738.433 us | 13.14% | -19209.037 us | -96.30% | FAIL |
| contains | true | 4 | 256 | 1953000 | 80 | 11.289 ms | 6.86% | 604.781 us | 11.85% | -10684.705 us | -94.64% | FAIL |
| contains | false | 4 | 256 | 1953000 | 80 | 11.653 ms | 6.44% | 563.360 us | 18.14% | -11089.333 us | -95.17% | FAIL |
| contains | true | 10 | 256 | 1953000 | 80 | 28.989 ms | 4.07% | 963.368 us | 11.19% | -28025.969 us | -96.68% | FAIL |
| contains | false | 10 | 256 | 1953000 | 80 | 29.227 ms | 4.28% | 894.524 us | 11.52% | -28332.314 us | -96.94% | FAIL |
| contains | true | 4 | 512 | 1953000 | 80 | 18.230 ms | 4.97% | 928.427 us | 8.98% | -17301.859 us | -94.91% | FAIL |
| contains | false | 4 | 512 | 1953000 | 80 | 19.118 ms | 7.26% | 842.596 us | 11.34% | -18275.059 us | -95.59% | FAIL |
| contains | true | 10 | 512 | 1953000 | 80 | 46.668 ms | 3.23% | 1.313 ms | 7.73% | -45355.073 us | -97.19% | FAIL |
| contains | false | 10 | 512 | 1953000 | 80 | 47.268 ms | 3.66% | 1.215 ms | 10.28% | -46052.467 us | -97.43% | FAIL |
| contains | true | 4 | 1024 | 1953000 | 80 | 32.666 ms | 5.16% | 1.593 ms | 7.88% | -31073.329 us | -95.12% | FAIL |
| contains | false | 4 | 1024 | 1953000 | 80 | 32.889 ms | 4.64% | 1.419 ms | 8.41% | -31469.794 us | -95.69% | FAIL |
| contains | true | 10 | 1024 | 1953000 | 80 | 81.290 ms | 1.51% | 2.049 ms | 7.26% | -79240.375 us | -97.48% | FAIL |
| contains | false | 10 | 1024 | 1953000 | 80 | 83.012 ms | 1.63% | 1.883 ms | 8.00% | -81129.182 us | -97.73% | FAIL |
| contains | true | 4 | 32 | 16777216 | 80 | 20.541 ms | 7.75% | 6.550 ms | 9.23% | -13991.403 us | -68.11% | FAIL |
| contains | false | 4 | 32 | 16777216 | 80 | 19.800 ms | 6.77% | 7.678 ms | 8.08% | -12122.169 us | -61.22% | FAIL |
| contains | true | 10 | 32 | 16777216 | 80 | 45.904 ms | 4.12% | 12.378 ms | 8.57% | -33526.749 us | -73.04% | FAIL |
| contains | false | 10 | 32 | 16777216 | 80 | 52.646 ms | 4.39% | 13.038 ms | 8.81% | -39608.033 us | -75.23% | FAIL |
| contains | true | 4 | 64 | 16777216 | 80 | 59.820 ms | 3.46% | 17.308 ms | 7.70% | -42511.449 us | -71.07% | FAIL |
| contains | false | 4 | 64 | 16777216 | 80 | 58.198 ms | 3.84% | 18.885 ms | 7.54% | -39312.954 us | -67.55% | FAIL |
| contains | true | 10 | 64 | 16777216 | 80 | 138.783 ms | 2.25% | 26.559 ms | 6.65% | -112224.207 us | -80.86% | FAIL |
| contains | false | 10 | 64 | 16777216 | 80 | 154.647 ms | 1.21% | 28.177 ms | 6.25% | -126470.330 us | -81.78% | FAIL |
TODO: Per current warp parallel implementation, will use large number of shared memory if num of targets is large. Will post a commit to split targets to small groups (each group contains 16 targets) to execute.
/ok to test
/ok to test
TODO: Per current warp parallel implementation, will use large number of shared memory if num of targets is large. Will post a commit to split targets to small groups (each group contains 16 targets) to execute.
Done.
/ok to test
/ok to test
The benchmark run fails with combine=true with the warp_parallel kernel enabled:
$ benchmarks/STRINGS_NVBENCH -d 0 -b find_string --axis api=contains
RMM memory resource = pool
CUIO host memory resource = pinned_pool
# Devices
## [0] `Quadro GV100`
* SM Version: 700 (PTX Version: 700)
* Number of SMs: 80
* SM Default Clock Rate: 1627 MHz
* Global Memory: 15872 MiB Free / 32491 MiB Total
* Global Memory Bus Peak: 870 GB/sec (4096-bit DDR @850MHz)
* Max Shared Memory: 96 KiB/SM, 48 KiB/Block
* L2 Cache Size: 6144 KiB
* Maximum Active Blocks: 32/SM
* Maximum Active Threads: 2048/SM, 1024/Block
* Available Registers: 65536/SM, 65536/Block
* ECC Enabled: No
# Log
Run: [1/36] find_string [Device=0 api=contains row_width=32 num_rows=260000 hit_rate=20]
Pass: Cold: 0.370707ms GPU, 0.375213ms CPU, 0.67s total GPU, 0.71s total wall, 1808x
Run: [2/36] find_string [Device=0 api=contains row_width=64 num_rows=260000 hit_rate=20]
Pass: Cold: 0.549474ms GPU, 0.553811ms CPU, 0.98s total GPU, 1.02s total wall, 1792x
Run: [3/36] find_string [Device=0 api=contains row_width=128 num_rows=260000 hit_rate=20]
/cudf/cpp/build/_deps/nvbench-src/nvbench/blocking_kernel.cu:113: Cuda API call returned error: cudaErrorIllegalAddress: an illegal memory access was encountered
/ok to test
TODO: test perf again.
/ok to test
Still have bugs, will fix ASAP. [Done]
/ok to test
/ok to test
10 targets:
| api | row_width | num_rows | hit_rate | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|----------------|-------------|------------|------------|------------|-------------|------------|-------------|----------------|---------|----------|
| multi-contains | 32 | 260000 | 20 | 1.218 ms | 65.51% | 268.035 us | 78.93% | -950.452 us | -78.00% | FAIL |
| multi-contains | 64 | 260000 | 20 | 3.110 ms | 34.86% | 501.955 us | 55.99% | -2608.098 us | -83.86% | FAIL |
| multi-contains | 128 | 260000 | 20 | 2.732 ms | 34.97% | 2.278 ms | 24.60% | -454.906 us | -16.65% | PASS |
| multi-contains | 256 | 260000 | 20 | 3.858 ms | 30.00% | 2.935 ms | 19.55% | -922.921 us | -23.92% | FAIL |
| multi-contains | 512 | 260000 | 20 | 6.166 ms | 23.28% | 4.234 ms | 18.92% | -1931.388 us | -31.32% | FAIL |
| multi-contains | 1024 | 260000 | 20 | 10.958 ms | 17.84% | 7.056 ms | 16.60% | -3902.223 us | -35.61% | FAIL |
| multi-contains | 32 | 1953000 | 20 | 6.061 ms | 23.65% | 1.503 ms | 25.71% | -4558.334 us | -75.21% | FAIL |
| multi-contains | 64 | 1953000 | 20 | 19.317 ms | 12.36% | 3.178 ms | 18.26% | -16139.825 us | -83.55% | FAIL |
| multi-contains | 128 | 1953000 | 20 | 19.104 ms | 12.10% | 17.400 ms | 12.08% | -1704.042 us | -8.92% | PASS |
| multi-contains | 256 | 1953000 | 20 | 28.448 ms | 12.48% | 22.440 ms | 10.52% | -6008.533 us | -21.12% | FAIL |
| multi-contains | 512 | 1953000 | 20 | 46.219 ms | 9.19% | 32.867 ms | 10.45% | -13351.815 us | -28.89% | FAIL |
| multi-contains | 1024 | 1953000 | 20 | 82.983 ms | 6.03% | 54.032 ms | 8.60% | -28950.742 us | -34.89% | FAIL |
| multi-contains | 32 | 16777216 | 20 | 51.669 ms | 9.30% | 14.716 ms | 18.01% | -36952.819 us | -71.52% | FAIL |
| multi-contains | 64 | 16777216 | 20 | 173.188 ms | 3.82% | 32.558 ms | 12.99% | -140629.639 us | -81.20% | FAIL |
| multi-contains | 32 | 260000 | 80 | 1.172 ms | 57.81% | 289.641 us | 125.07% | -882.034 us | -75.28% | FAIL |
| multi-contains | 64 | 260000 | 80 | 2.927 ms | 35.41% | 561.831 us | 93.10% | -2365.275 us | -80.81% | FAIL |
| multi-contains | 128 | 260000 | 80 | 2.928 ms | 32.18% | 2.266 ms | 46.43% | -662.390 us | -22.62% | PASS |
| multi-contains | 256 | 260000 | 80 | 4.074 ms | 25.41% | 3.188 ms | 35.26% | -886.029 us | -21.75% | PASS |
| multi-contains | 512 | 260000 | 80 | 6.402 ms | 19.86% | 4.671 ms | 30.59% | -1731.671 us | -27.05% | FAIL |
| multi-contains | 1024 | 260000 | 80 | 11.119 ms | 14.54% | 8.237 ms | 24.30% | -2881.433 us | -25.92% | FAIL |
| multi-contains | 32 | 1953000 | 80 | 6.361 ms | 23.97% | 1.588 ms | 56.19% | -4773.543 us | -75.04% | FAIL |
| multi-contains | 64 | 1953000 | 80 | 18.283 ms | 12.31% | 3.546 ms | 37.89% | -14736.689 us | -80.60% | FAIL |
| multi-contains | 128 | 1953000 | 80 | 20.067 ms | 10.31% | 18.718 ms | 18.45% | -1349.610 us | -6.73% | PASS |
| multi-contains | 256 | 1953000 | 80 | 29.301 ms | 9.70% | 24.991 ms | 14.36% | -4310.135 us | -14.71% | FAIL |
| multi-contains | 512 | 1953000 | 80 | 50.853 ms | 16.78% | 34.222 ms | 16.97% | -16630.364 us | -32.70% | FAIL |
| multi-contains | 1024 | 1953000 | 80 | 86.514 ms | 11.20% | 53.475 ms | 9.21% | -33038.528 us | -38.19% | FAIL |
| multi-contains | 32 | 16777216 | 80 | 53.945 ms | 15.76% | 12.412 ms | 14.38% | -41532.866 us | -76.99% | FAIL |
| multi-contains | 64 | 16777216 | 80 | 157.562 ms | 5.01% | 26.559 ms | 12.01% | -131002.626 us | -83.14% | FAIL |
17 targets, will trigger splitting targets into groups for long strings.
| api | row_width | num_rows | hit_rate | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|----------------|-------------|------------|------------|------------|-------------|------------|-------------|----------------|---------|----------|
| multi-contains | 32 | 260000 | 20 | 1.897 ms | 45.79% | 379.193 us | 80.83% | -1517.685 us | -80.01% | FAIL |
| multi-contains | 64 | 260000 | 20 | 5.116 ms | 26.24% | 670.106 us | 54.93% | -4445.432 us | -86.90% | FAIL |
| multi-contains | 128 | 260000 | 20 | 4.602 ms | 28.92% | 4.188 ms | 26.63% | -414.147 us | -9.00% | PASS |
| multi-contains | 256 | 260000 | 20 | 6.583 ms | 25.17% | 5.115 ms | 24.46% | -1468.011 us | -22.30% | PASS |
| multi-contains | 512 | 260000 | 20 | 10.581 ms | 18.20% | 7.341 ms | 22.93% | -3239.323 us | -30.62% | FAIL |
| multi-contains | 1024 | 260000 | 20 | 18.769 ms | 13.08% | 11.707 ms | 15.58% | -7062.500 us | -37.63% | FAIL |
| multi-contains | 32 | 1953000 | 20 | 10.407 ms | 18.16% | 2.102 ms | 33.86% | -8304.688 us | -79.80% | FAIL |
| multi-contains | 64 | 1953000 | 20 | 32.780 ms | 7.96% | 4.344 ms | 23.12% | -28436.560 us | -86.75% | FAIL |
| multi-contains | 128 | 1953000 | 20 | 32.429 ms | 8.71% | 30.392 ms | 11.18% | -2036.997 us | -6.28% | PASS |
| multi-contains | 256 | 1953000 | 20 | 47.826 ms | 7.99% | 38.908 ms | 10.68% | -8917.890 us | -18.65% | FAIL |
| multi-contains | 512 | 1953000 | 20 | 78.474 ms | 5.52% | 55.672 ms | 8.11% | -22801.487 us | -29.06% | FAIL |
| multi-contains | 1024 | 1953000 | 20 | 147.366 ms | 9.68% | 90.115 ms | 5.53% | -57250.488 us | -38.85% | FAIL |
| multi-contains | 32 | 16777216 | 20 | 87.389 ms | 6.52% | 18.063 ms | 12.45% | -69325.973 us | -79.33% | FAIL |
| multi-contains | 64 | 16777216 | 20 | 295.148 ms | 3.85% | 37.715 ms | 9.51% | -257432.763 us | -87.22% | FAIL |
| multi-contains | 32 | 260000 | 80 | 1.897 ms | 45.14% | 332.214 us | 100.03% | -1564.822 us | -82.49% | FAIL |
| multi-contains | 64 | 260000 | 80 | 4.984 ms | 35.32% | 656.064 us | 62.40% | -4328.034 us | -86.84% | FAIL |
| multi-contains | 128 | 260000 | 80 | 5.002 ms | 32.97% | 3.685 ms | 27.45% | -1316.759 us | -26.32% | PASS |
| multi-contains | 256 | 260000 | 80 | 7.020 ms | 27.36% | 4.751 ms | 25.48% | -2268.563 us | -32.32% | FAIL |
| multi-contains | 512 | 260000 | 80 | 11.013 ms | 23.13% | 6.907 ms | 22.24% | -4105.964 us | -37.28% | FAIL |
| multi-contains | 1024 | 260000 | 80 | 19.126 ms | 16.63% | 11.269 ms | 17.34% | -7857.560 us | -41.08% | FAIL |
| multi-contains | 32 | 1953000 | 80 | 10.463 ms | 21.13% | 1.944 ms | 33.11% | -8518.859 us | -81.42% | FAIL |
| multi-contains | 64 | 1953000 | 80 | 31.039 ms | 12.54% | 3.975 ms | 22.84% | -27063.687 us | -87.19% | FAIL |
| multi-contains | 128 | 1953000 | 80 | 34.141 ms | 10.37% | 26.625 ms | 10.75% | -7515.375 us | -22.01% | FAIL |
| multi-contains | 256 | 1953000 | 80 | 49.650 ms | 8.39% | 35.039 ms | 10.16% | -14611.786 us | -29.43% | FAIL |
| multi-contains | 512 | 1953000 | 80 | 79.993 ms | 5.78% | 51.117 ms | 7.35% | -28875.400 us | -36.10% | FAIL |
| multi-contains | 1024 | 1953000 | 80 | 141.384 ms | 2.82% | 84.648 ms | 5.97% | -56736.002 us | -40.13% | FAIL |
| multi-contains | 32 | 16777216 | 80 | 84.886 ms | 5.75% | 16.247 ms | 13.63% | -68639.457 us | -80.86% | FAIL |
| multi-contains | 64 | 16777216 | 80 | 262.663 ms | 2.30% | 33.554 ms | 9.06% | -229109.558 us | -87.23% | FAIL |
Replaced with https://github.com/rapidsai/cudf/pull/16900, so close this.