cccl
cccl copied to clipboard
Add support for large `num_items` to `device_select.cuh`
### Tasks
- [x] benchmark unsigned versus signed offset types in `DeviceSelect`
- [x] try to mitigate any perf degradation resulting from moving to an unsigned offset type
- [x] https://github.com/NVIDIA/cccl/issues/1584
- [x] add tests for inputs close to `std::numeric_limits<OffsetT>::max()` for 32-bit offset types to make sure we don't run into integer overflows during offset
- [ ] https://github.com/NVIDIA/cccl/issues/2311
- [ ] https://github.com/NVIDIA/cccl/issues/2312
- [ ] https://github.com/NVIDIA/cccl/issues/2238
- [x] Benchmark changes against main for small, medium, and very large number of items
- [ ] https://github.com/NVIDIA/cccl/issues/1486
cub.bench.select.if.base: signed versus unsigned offset types
[0] Tesla V100-SXM2-32GB
T{ct} | OffsetT{ct} | Elements{io} | Entropy | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
---|---|---|---|---|---|---|---|---|---|---|
I8 | I32 | 2^16 | 1 | 8.930 us | 6.37% | 8.984 us | 6.42% | 0.054 us | 0.61% | PASS |
I8 | I32 | 2^20 | 1 | 14.559 us | 3.37% | 14.431 us | 3.13% | -0.128 us | -0.88% | PASS |
I8 | I32 | 2^24 | 1 | 104.069 us | 0.95% | 99.623 us | 0.98% | -4.446 us | -4.27% | FAIL |
I8 | I32 | 2^28 | 1 | 1.540 ms | 0.50% | 1.474 ms | 0.50% | -66.427 us | -4.31% | FAIL |
I8 | I32 | 2^16 | 0.544 | 8.799 us | 5.77% | 8.847 us | 5.62% | 0.048 us | 0.55% | PASS |
I8 | I32 | 2^20 | 0.544 | 14.434 us | 3.36% | 14.151 us | 3.22% | -0.283 us | -1.96% | PASS |
I8 | I32 | 2^24 | 0.544 | 96.522 us | 0.84% | 91.525 us | 0.82% | -4.997 us | -5.18% | FAIL |
I8 | I32 | 2^28 | 0.544 | 1.414 ms | 0.50% | 1.334 ms | 0.50% | -79.674 us | -5.64% | FAIL |
I8 | I32 | 2^16 | 0 | 8.477 us | 5.46% | 8.453 us | 5.29% | -0.023 us | -0.28% | PASS |
I8 | I32 | 2^20 | 0 | 13.636 us | 3.62% | 13.511 us | 3.32% | -0.125 us | -0.92% | PASS |
I8 | I32 | 2^24 | 0 | 87.201 us | 0.68% | 82.173 us | 0.72% | -5.028 us | -5.77% | FAIL |
I8 | I32 | 2^28 | 0 | 1.245 ms | 0.45% | 1.161 ms | 0.49% | -83.970 us | -6.74% | FAIL |
I8 | I64 | 2^16 | 1 | 8.770 us | 5.82% | 8.824 us | 5.70% | 0.055 us | 0.62% | PASS |
I8 | I64 | 2^20 | 1 | 14.659 us | 3.37% | 14.847 us | 3.50% | 0.189 us | 1.29% | PASS |
I8 | I64 | 2^24 | 1 | 108.105 us | 0.73% | 108.920 us | 0.75% | 0.814 us | 0.75% | FAIL |
I8 | I64 | 2^28 | 1 | 1.607 ms | 0.50% | 1.617 ms | 0.50% | 10.795 us | 0.67% | FAIL |
I8 | I64 | 2^16 | 0.544 | 8.773 us | 5.82% | 8.813 us | 5.74% | 0.040 us | 0.46% | PASS |
I8 | I64 | 2^20 | 0.544 | 14.354 us | 3.24% | 14.406 us | 3.03% | 0.052 us | 0.36% | PASS |
I8 | I64 | 2^24 | 0.544 | 101.642 us | 0.68% | 102.324 us | 0.70% | 0.683 us | 0.67% | PASS |
I8 | I64 | 2^28 | 0.544 | 1.506 ms | 0.50% | 1.512 ms | 0.50% | 5.677 us | 0.38% | PASS |
I8 | I64 | 2^16 | 0 | 8.488 us | 5.48% | 8.546 us | 5.74% | 0.058 us | 0.69% | PASS |
I8 | I64 | 2^20 | 0 | 13.986 us | 3.81% | 14.147 us | 3.47% | 0.161 us | 1.15% | PASS |
I8 | I64 | 2^24 | 0 | 91.343 us | 0.59% | 92.670 us | 0.59% | 1.327 us | 1.45% | FAIL |
I8 | I64 | 2^28 | 0 | 1.308 ms | 0.50% | 1.326 ms | 0.50% | 17.942 us | 1.37% | FAIL |
I16 | I32 | 2^16 | 1 | 9.279 us | 4.37% | 8.820 us | 5.69% | -0.459 us | -4.94% | FAIL |
I16 | I32 | 2^20 | 1 | 15.774 us | 3.30% | 15.618 us | 3.03% | -0.157 us | -0.99% | PASS |
I16 | I32 | 2^24 | 1 | 122.989 us | 1.22% | 119.690 us | 1.18% | -3.299 us | -2.68% | FAIL |
I16 | I32 | 2^28 | 1 | 1.829 ms | 0.52% | 1.778 ms | 0.56% | -50.697 us | -2.77% | FAIL |
I16 | I32 | 2^16 | 0.544 | 9.018 us | 4.72% | 8.902 us | 5.44% | -0.117 us | -1.29% | PASS |
I16 | I32 | 2^20 | 0.544 | 15.707 us | 3.28% | 15.336 us | 2.72% | -0.371 us | -2.36% | PASS |
I16 | I32 | 2^24 | 0.544 | 112.747 us | 1.15% | 108.236 us | 1.21% | -4.512 us | -4.00% | FAIL |
I16 | I32 | 2^28 | 0.544 | 1.655 ms | 0.50% | 1.589 ms | 0.50% | -66.748 us | -4.03% | FAIL |
I16 | I32 | 2^16 | 0 | 8.669 us | 5.92% | 8.632 us | 5.90% | -0.036 us | -0.42% | PASS |
I16 | I32 | 2^20 | 0 | 15.141 us | 3.40% | 15.083 us | 3.30% | -0.058 us | -0.39% | PASS |
I16 | I32 | 2^24 | 0 | 92.163 us | 0.78% | 88.110 us | 0.79% | -4.053 us | -4.40% | FAIL |
I16 | I32 | 2^28 | 0 | 1.291 ms | 0.50% | 1.220 ms | 0.50% | -71.447 us | -5.53% | FAIL |
I16 | I64 | 2^16 | 1 | 9.331 us | 4.01% | 9.062 us | 4.54% | -0.269 us | -2.88% | PASS |
I16 | I64 | 2^20 | 1 | 16.097 us | 3.19% | 16.344 us | 2.91% | 0.247 us | 1.53% | PASS |
I16 | I64 | 2^24 | 1 | 125.353 us | 1.04% | 126.075 us | 1.01% | 0.722 us | 0.58% | PASS |
I16 | I64 | 2^28 | 1 | 1.863 ms | 0.50% | 1.872 ms | 0.50% | 8.985 us | 0.48% | PASS |
I16 | I64 | 2^16 | 0.544 | 8.977 us | 5.02% | 9.088 us | 4.37% | 0.111 us | 1.24% | PASS |
I16 | I64 | 2^20 | 0.544 | 16.029 us | 3.34% | 15.873 us | 3.28% | -0.156 us | -0.98% | PASS |
I16 | I64 | 2^24 | 0.544 | 115.299 us | 0.90% | 116.190 us | 0.88% | 0.891 us | 0.77% | PASS |
I16 | I64 | 2^28 | 0.544 | 1.699 ms | 0.50% | 1.713 ms | 0.50% | 13.120 us | 0.77% | FAIL |
I16 | I64 | 2^16 | 0 | 8.705 us | 5.94% | 8.801 us | 5.77% | 0.096 us | 1.10% | PASS |
I16 | I64 | 2^20 | 0 | 15.414 us | 3.20% | 15.579 us | 3.07% | 0.165 us | 1.07% | PASS |
I16 | I64 | 2^24 | 0 | 96.224 us | 0.69% | 97.288 us | 0.69% | 1.064 us | 1.11% | FAIL |
I16 | I64 | 2^28 | 0 | 1.356 ms | 0.50% | 1.371 ms | 0.50% | 15.056 us | 1.11% | FAIL |
I32 | I32 | 2^16 | 1 | 8.994 us | 4.89% | 9.424 us | 4.74% | 0.430 us | 4.78% | FAIL |
I32 | I32 | 2^20 | 1 | 18.774 us | 2.70% | 18.926 us | 2.83% | 0.152 us | 0.81% | PASS |
I32 | I32 | 2^24 | 1 | 182.142 us | 0.77% | 182.708 us | 0.80% | 0.567 us | 0.31% | PASS |
I32 | I32 | 2^28 | 1 | 2.796 ms | 0.58% | 2.800 ms | 0.58% | 4.156 us | 0.15% | PASS |
I32 | I32 | 2^16 | 0.544 | 9.109 us | 4.29% | 9.000 us | 4.99% | -0.109 us | -1.20% | PASS |
I32 | I32 | 2^20 | 0.544 | 18.963 us | 2.86% | 18.956 us | 2.84% | -0.007 us | -0.04% | PASS |
I32 | I32 | 2^24 | 0.544 | 151.466 us | 0.94% | 151.512 us | 0.96% | 0.046 us | 0.03% | PASS |
I32 | I32 | 2^28 | 0.544 | 2.268 ms | 0.50% | 2.271 ms | 0.50% | 2.676 us | 0.12% | PASS |
I32 | I32 | 2^16 | 0 | 8.807 us | 5.82% | 8.717 us | 5.94% | -0.090 us | -1.02% | PASS |
I32 | I32 | 2^20 | 0 | 18.128 us | 2.80% | 18.081 us | 2.87% | -0.046 us | -0.26% | PASS |
I32 | I32 | 2^24 | 0 | 108.646 us | 1.01% | 108.790 us | 1.03% | 0.144 us | 0.13% | PASS |
I32 | I32 | 2^28 | 0 | 1.496 ms | 1.19% | 1.497 ms | 1.19% | 1.699 us | 0.11% | PASS |
I32 | I64 | 2^16 | 1 | 9.131 us | 4.19% | 9.234 us | 3.16% | 0.103 us | 1.13% | PASS |
I32 | I64 | 2^20 | 1 | 19.115 us | 3.03% | 19.104 us | 2.75% | -0.011 us | -0.06% | PASS |
I32 | I64 | 2^24 | 1 | 184.830 us | 0.90% | 184.204 us | 0.86% | -0.626 us | -0.34% | PASS |
I32 | I64 | 2^28 | 1 | 2.833 ms | 0.59% | 2.824 ms | 0.57% | -8.893 us | -0.31% | PASS |
I32 | I64 | 2^16 | 0.544 | 9.129 us | 4.30% | 9.218 us | 3.37% | 0.089 us | 0.98% | PASS |
I32 | I64 | 2^20 | 0.544 | 19.130 us | 4.56% | 19.192 us | 2.78% | 0.062 us | 0.32% | PASS |
I32 | I64 | 2^24 | 0.544 | 154.892 us | 1.11% | 153.039 us | 0.99% | -1.852 us | -1.20% | FAIL |
I32 | I64 | 2^28 | 0.544 | 2.328 ms | 0.50% | 2.297 ms | 0.50% | -30.851 us | -1.33% | FAIL |
I32 | I64 | 2^16 | 0 | 8.831 us | 9.11% | 8.880 us | 5.58% | 0.049 us | 0.55% | PASS |
I32 | I64 | 2^20 | 0 | 18.127 us | 4.29% | 18.174 us | 2.76% | 0.046 us | 0.26% | PASS |
I32 | I64 | 2^24 | 0 | 114.131 us | 1.04% | 112.163 us | 0.98% | -1.969 us | -1.72% | FAIL |
I32 | I64 | 2^28 | 0 | 1.592 ms | 0.99% | 1.551 ms | 1.09% | -40.839 us | -2.57% | FAIL |
I64 | I32 | 2^16 | 1 | 9.868 us | 7.14% | 9.986 us | 4.67% | 0.118 us | 1.19% | PASS |
I64 | I32 | 2^20 | 1 | 29.292 us | 2.98% | 29.363 us | 2.40% | 0.071 us | 0.24% | PASS |
I64 | I32 | 2^24 | 1 | 348.962 us | 0.50% | 348.794 us | 0.50% | -0.168 us | -0.05% | PASS |
I64 | I32 | 2^28 | 1 | 5.458 ms | 0.50% | 5.456 ms | 0.50% | -1.461 us | -0.03% | PASS |
I64 | I32 | 2^16 | 0.544 | 10.521 us | 6.66% | 10.416 us | 4.33% | -0.106 us | -1.01% | PASS |
I64 | I32 | 2^20 | 0.544 | 27.335 us | 2.77% | 27.205 us | 2.22% | -0.130 us | -0.48% | PASS |
I64 | I32 | 2^24 | 0.544 | 279.669 us | 0.61% | 279.501 us | 0.60% | -0.167 us | -0.06% | PASS |
I64 | I32 | 2^28 | 0.544 | 4.311 ms | 0.50% | 4.311 ms | 0.50% | 0.102 us | 0.00% | PASS |
I64 | I32 | 2^16 | 0 | 9.751 us | 7.74% | 9.680 us | 5.36% | -0.070 us | -0.72% | PASS |
I64 | I32 | 2^20 | 0 | 26.911 us | 2.95% | 26.729 us | 2.39% | -0.182 us | -0.68% | PASS |
I64 | I32 | 2^24 | 0 | 189.068 us | 0.91% | 188.791 us | 0.86% | -0.277 us | -0.15% | PASS |
I64 | I32 | 2^28 | 0 | 2.765 ms | 0.95% | 2.763 ms | 0.96% | -1.896 us | -0.07% | PASS |
I64 | I64 | 2^16 | 1 | 10.448 us | 6.37% | 10.422 us | 4.26% | -0.026 us | -0.25% | PASS |
I64 | I64 | 2^20 | 1 | 29.609 us | 2.86% | 29.594 us | 2.39% | -0.014 us | -0.05% | PASS |
I64 | I64 | 2^24 | 1 | 350.438 us | 0.54% | 350.396 us | 0.50% | -0.041 us | -0.01% | PASS |
I64 | I64 | 2^28 | 1 | 5.473 ms | 0.50% | 5.476 ms | 0.50% | 2.974 us | 0.05% | PASS |
I64 | I64 | 2^16 | 0.544 | 10.174 us | 7.09% | 10.107 us | 4.57% | -0.066 us | -0.65% | PASS |
I64 | I64 | 2^20 | 0.544 | 27.675 us | 2.91% | 27.617 us | 2.14% | -0.059 us | -0.21% | PASS |
I64 | I64 | 2^24 | 0.544 | 281.827 us | 0.63% | 281.893 us | 0.60% | 0.067 us | 0.02% | PASS |
I64 | I64 | 2^28 | 0.544 | 4.341 ms | 0.50% | 4.343 ms | 0.50% | 1.952 us | 0.04% | PASS |
I64 | I64 | 2^16 | 0 | 10.170 us | 6.47% | 10.243 us | 4.08% | 0.074 us | 0.72% | PASS |
I64 | I64 | 2^20 | 0 | 27.191 us | 3.03% | 27.292 us | 2.39% | 0.101 us | 0.37% | PASS |
I64 | I64 | 2^24 | 0 | 192.592 us | 0.85% | 193.404 us | 0.83% | 0.812 us | 0.42% | PASS |
I64 | I64 | 2^28 | 0 | 2.823 ms | 0.91% | 2.839 ms | 0.90% | 15.763 us | 0.56% | PASS |
I128 | I32 | 2^16 | 1 | 12.180 us | 5.89% | 12.219 us | 3.21% | 0.039 us | 0.32% | PASS |
I128 | I32 | 2^20 | 1 | 39.499 us | 2.27% | 39.611 us | 1.50% | 0.112 us | 0.28% | PASS |
I128 | I32 | 2^24 | 1 | 361.077 us | 0.64% | 362.579 us | 0.61% | 1.502 us | 0.42% | PASS |
I128 | I32 | 2^28 | 1 | 5.524 ms | 0.65% | 5.550 ms | 0.64% | 25.811 us | 0.47% | PASS |
I128 | I32 | 2^16 | 0.544 | 12.135 us | 5.42% | 12.280 us | 3.33% | 0.145 us | 1.20% | PASS |
I128 | I32 | 2^20 | 0.544 | 39.472 us | 1.92% | 39.598 us | 1.51% | 0.127 us | 0.32% | PASS |
I128 | I32 | 2^24 | 0.544 | 361.046 us | 0.65% | 362.558 us | 0.60% | 1.512 us | 0.42% | PASS |
I128 | I32 | 2^28 | 0.544 | 5.524 ms | 0.65% | 5.550 ms | 0.64% | 25.909 us | 0.47% | PASS |
I128 | I32 | 2^16 | 0 | 12.166 us | 5.36% | 12.219 us | 3.33% | 0.054 us | 0.44% | PASS |
I128 | I32 | 2^20 | 0 | 39.407 us | 1.91% | 39.574 us | 1.49% | 0.167 us | 0.42% | PASS |
I128 | I32 | 2^24 | 0 | 360.967 us | 0.67% | 362.596 us | 0.59% | 1.629 us | 0.45% | PASS |
I128 | I32 | 2^28 | 0 | 5.524 ms | 0.65% | 5.550 ms | 0.64% | 25.893 us | 0.47% | PASS |
I128 | I64 | 2^16 | 1 | 11.785 us | 6.30% | 11.779 us | 4.43% | -0.006 us | -0.05% | PASS |
I128 | I64 | 2^20 | 1 | 40.619 us | 1.98% | 40.602 us | 1.62% | -0.016 us | -0.04% | PASS |
I128 | I64 | 2^24 | 1 | 401.233 us | 0.51% | 401.529 us | 0.50% | 0.296 us | 0.07% | PASS |
I128 | I64 | 2^28 | 1 | 6.193 ms | 0.50% | 6.202 ms | 0.50% | 9.428 us | 0.15% | PASS |
I128 | I64 | 2^16 | 0.544 | 11.804 us | 6.61% | 11.827 us | 4.41% | 0.024 us | 0.20% | PASS |
I128 | I64 | 2^20 | 0.544 | 40.619 us | 1.94% | 40.582 us | 1.60% | -0.037 us | -0.09% | PASS |
I128 | I64 | 2^24 | 0.544 | 401.259 us | 0.50% | 401.523 us | 0.50% | 0.264 us | 0.07% | PASS |
I128 | I64 | 2^28 | 0.544 | 6.193 ms | 0.50% | 6.202 ms | 0.50% | 9.147 us | 0.15% | PASS |
I128 | I64 | 2^16 | 0 | 11.730 us | 6.34% | 11.789 us | 4.45% | 0.059 us | 0.50% | PASS |
I128 | I64 | 2^20 | 0 | 40.562 us | 1.99% | 40.597 us | 1.61% | 0.035 us | 0.09% | PASS |
I128 | I64 | 2^24 | 0 | 401.268 us | 0.51% | 401.452 us | 0.50% | 0.184 us | 0.05% | PASS |
I128 | I64 | 2^28 | 0 | 6.193 ms | 0.50% | 6.202 ms | 0.50% | 9.374 us | 0.15% | PASS |
F32 | I32 | 2^16 | 1 | 9.134 us | 7.85% | 9.137 us | 4.31% | 0.003 us | 0.03% | PASS |
F32 | I32 | 2^20 | 1 | 18.950 us | 3.76% | 18.986 us | 2.81% | 0.036 us | 0.19% | PASS |
F32 | I32 | 2^24 | 1 | 182.801 us | 0.85% | 183.047 us | 0.82% | 0.247 us | 0.13% | PASS |
F32 | I32 | 2^28 | 1 | 2.940 ms | 0.67% | 2.944 ms | 0.67% | 3.611 us | 0.12% | PASS |
F32 | I32 | 2^16 | 0.544 | 8.929 us | 7.87% | 8.858 us | 5.62% | -0.072 us | -0.80% | PASS |
F32 | I32 | 2^20 | 0.544 | 18.370 us | 4.03% | 18.467 us | 2.77% | 0.097 us | 0.53% | PASS |
F32 | I32 | 2^24 | 0.544 | 125.099 us | 1.19% | 125.454 us | 1.08% | 0.356 us | 0.28% | PASS |
F32 | I32 | 2^28 | 0.544 | 1.808 ms | 0.69% | 1.813 ms | 0.69% | 4.483 us | 0.25% | PASS |
F32 | I32 | 2^16 | 0 | 8.788 us | 8.72% | 8.752 us | 5.89% | -0.036 us | -0.41% | PASS |
F32 | I32 | 2^20 | 0 | 18.181 us | 3.77% | 18.225 us | 2.71% | 0.044 us | 0.24% | PASS |
F32 | I32 | 2^24 | 0 | 108.643 us | 1.13% | 109.101 us | 1.02% | 0.458 us | 0.42% | PASS |
F32 | I32 | 2^28 | 0 | 1.496 ms | 1.19% | 1.504 ms | 1.17% | 7.968 us | 0.53% | PASS |
F32 | I64 | 2^16 | 1 | 9.351 us | 6.98% | 9.285 us | 3.20% | -0.065 us | -0.70% | PASS |
F32 | I64 | 2^20 | 1 | 19.256 us | 4.11% | 19.305 us | 2.79% | 0.050 us | 0.26% | PASS |
F32 | I64 | 2^24 | 1 | 185.170 us | 0.97% | 184.398 us | 0.86% | -0.772 us | -0.42% | PASS |
F32 | I64 | 2^28 | 1 | 2.957 ms | 0.66% | 2.953 ms | 0.67% | -4.099 us | -0.14% | PASS |
F32 | I64 | 2^16 | 0.544 | 9.046 us | 7.47% | 9.006 us | 5.02% | -0.040 us | -0.44% | PASS |
F32 | I64 | 2^20 | 0.544 | 18.534 us | 3.57% | 18.536 us | 2.63% | 0.001 us | 0.01% | PASS |
F32 | I64 | 2^24 | 0.544 | 129.131 us | 1.12% | 128.212 us | 1.01% | -0.918 us | -0.71% | PASS |
F32 | I64 | 2^28 | 0.544 | 1.872 ms | 0.72% | 1.863 ms | 0.59% | -8.846 us | -0.47% | PASS |
F32 | I64 | 2^16 | 0 | 8.874 us | 6.76% | 8.844 us | 5.66% | -0.029 us | -0.33% | PASS |
F32 | I64 | 2^20 | 0 | 18.164 us | 2.87% | 18.125 us | 2.82% | -0.039 us | -0.22% | PASS |
F32 | I64 | 2^24 | 0 | 114.475 us | 0.96% | 112.431 us | 0.99% | -2.044 us | -1.79% | FAIL |
F32 | I64 | 2^28 | 0 | 1.596 ms | 0.99% | 1.557 ms | 1.09% | -38.802 us | -2.43% | FAIL |
F64 | I32 | 2^16 | 1 | 10.226 us | 4.18% | 10.275 us | 4.19% | 0.049 us | 0.48% | PASS |
F64 | I32 | 2^20 | 1 | 29.350 us | 2.46% | 29.400 us | 2.46% | 0.050 us | 0.17% | PASS |
F64 | I32 | 2^24 | 1 | 348.778 us | 0.50% | 349.010 us | 0.50% | 0.232 us | 0.07% | PASS |
F64 | I32 | 2^28 | 1 | 5.456 ms | 0.50% | 5.459 ms | 0.50% | 2.836 us | 0.05% | PASS |
F64 | I32 | 2^16 | 0.544 | 9.717 us | 5.36% | 9.791 us | 5.36% | 0.074 us | 0.76% | PASS |
F64 | I32 | 2^20 | 0.544 | 26.550 us | 2.29% | 26.659 us | 2.25% | 0.109 us | 0.41% | PASS |
F64 | I32 | 2^24 | 0.544 | 222.605 us | 0.72% | 222.988 us | 0.73% | 0.383 us | 0.17% | PASS |
F64 | I32 | 2^28 | 0.544 | 3.348 ms | 0.51% | 3.353 ms | 0.50% | 4.932 us | 0.15% | PASS |
F64 | I32 | 2^16 | 0 | 9.652 us | 5.30% | 9.726 us | 5.36% | 0.074 us | 0.76% | PASS |
F64 | I32 | 2^20 | 0 | 26.766 us | 2.41% | 26.859 us | 2.38% | 0.093 us | 0.35% | PASS |
F64 | I32 | 2^24 | 0 | 188.788 us | 0.87% | 189.233 us | 0.89% | 0.445 us | 0.24% | PASS |
F64 | I32 | 2^28 | 0 | 2.761 ms | 0.96% | 2.769 ms | 0.95% | 7.647 us | 0.28% | PASS |
F64 | I64 | 2^16 | 1 | 10.611 us | 4.71% | 10.536 us | 4.47% | -0.076 us | -0.71% | PASS |
F64 | I64 | 2^20 | 1 | 29.736 us | 2.44% | 29.825 us | 2.47% | 0.089 us | 0.30% | PASS |
F64 | I64 | 2^24 | 1 | 350.297 us | 0.50% | 350.369 us | 0.50% | 0.072 us | 0.02% | PASS |
F64 | I64 | 2^28 | 1 | 5.472 ms | 0.50% | 5.473 ms | 0.50% | 0.601 us | 0.01% | PASS |
F64 | I64 | 2^16 | 0.544 | 10.331 us | 6.59% | 10.223 us | 4.05% | -0.109 us | -1.05% | PASS |
F64 | I64 | 2^20 | 0.544 | 26.938 us | 2.84% | 26.847 us | 2.22% | -0.091 us | -0.34% | PASS |
F64 | I64 | 2^24 | 0.544 | 224.604 us | 0.74% | 224.608 us | 0.71% | 0.004 us | 0.00% | PASS |
F64 | I64 | 2^28 | 0.544 | 3.378 ms | 0.50% | 3.380 ms | 0.50% | 1.763 us | 0.05% | PASS |
F64 | I64 | 2^16 | 0 | 10.189 us | 7.05% | 10.105 us | 4.44% | -0.084 us | -0.82% | PASS |
F64 | I64 | 2^20 | 0 | 27.264 us | 2.89% | 27.206 us | 2.36% | -0.058 us | -0.21% | PASS |
F64 | I64 | 2^24 | 0 | 192.524 us | 0.91% | 192.890 us | 0.83% | 0.366 us | 0.19% | PASS |
F64 | I64 | 2^28 | 0 | 2.820 ms | 0.90% | 2.829 ms | 0.90% | 8.648 us | 0.31% | PASS |
Seeing some noticeable performance drops for:
- i64 elements with 64-bit offset type
- f64 elements with 64-bit offset type
- i128 elements with 32-bit offset type
cub.bench.select.flagged.base: signed versus unsigned offset types
## [0] Tesla V100-SXM2-32GBT{ct} | OffsetT{ct} | Elements{io} | Entropy | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
---|---|---|---|---|---|---|---|---|---|---|
I8 | I32 | 2^16 | 1 | 8.820 us | 6.44% | 8.876 us | 6.24% | 0.056 us | 0.64% | PASS |
I8 | I32 | 2^20 | 1 | 15.357 us | 2.70% | 15.417 us | 2.69% | 0.059 us | 0.39% | PASS |
I8 | I32 | 2^24 | 1 | 111.282 us | 0.93% | 111.720 us | 0.91% | 0.438 us | 0.39% | PASS |
I8 | I32 | 2^28 | 1 | 1.655 ms | 0.50% | 1.659 ms | 0.50% | 3.890 us | 0.24% | PASS |
I8 | I32 | 2^16 | 0.544 | 8.710 us | 5.91% | 8.810 us | 5.78% | 0.099 us | 1.14% | PASS |
I8 | I32 | 2^20 | 0.544 | 15.212 us | 3.09% | 15.213 us | 3.01% | 0.001 us | 0.01% | PASS |
I8 | I32 | 2^24 | 0.544 | 108.060 us | 0.98% | 108.216 us | 0.98% | 0.156 us | 0.14% | PASS |
I8 | I32 | 2^28 | 0.544 | 1.576 ms | 0.64% | 1.576 ms | 0.64% | 0.018 us | 0.00% | PASS |
I8 | I32 | 2^16 | 0 | 8.417 us | 5.08% | 8.483 us | 5.46% | 0.066 us | 0.78% | PASS |
I8 | I32 | 2^20 | 0 | 14.724 us | 3.44% | 14.796 us | 3.49% | 0.071 us | 0.48% | PASS |
I8 | I32 | 2^24 | 0 | 93.078 us | 0.74% | 92.757 us | 0.72% | -0.321 us | -0.34% | PASS |
I8 | I32 | 2^28 | 0 | 1.314 ms | 0.10% | 1.310 ms | 0.11% | -3.783 us | -0.29% | FAIL |
I8 | I64 | 2^16 | 1 | 8.813 us | 5.74% | 8.809 us | 5.79% | -0.004 us | -0.05% | PASS |
I8 | I64 | 2^20 | 1 | 16.088 us | 3.21% | 16.115 us | 3.12% | 0.026 us | 0.16% | PASS |
I8 | I64 | 2^24 | 1 | 119.316 us | 0.75% | 120.151 us | 0.73% | 0.835 us | 0.70% | PASS |
I8 | I64 | 2^28 | 1 | 1.789 ms | 0.50% | 1.801 ms | 0.50% | 11.664 us | 0.65% | FAIL |
I8 | I64 | 2^16 | 0.544 | 8.819 us | 5.72% | 8.846 us | 5.62% | 0.027 us | 0.31% | PASS |
I8 | I64 | 2^20 | 0.544 | 16.069 us | 3.18% | 16.043 us | 3.16% | -0.026 us | -0.16% | PASS |
I8 | I64 | 2^24 | 0.544 | 116.079 us | 0.84% | 117.019 us | 0.82% | 0.940 us | 0.81% | PASS |
I8 | I64 | 2^28 | 0.544 | 1.711 ms | 0.50% | 1.723 ms | 0.50% | 11.844 us | 0.69% | FAIL |
I8 | I64 | 2^16 | 0 | 8.585 us | 5.82% | 8.603 us | 5.86% | 0.018 us | 0.21% | PASS |
I8 | I64 | 2^20 | 0 | 15.388 us | 2.98% | 15.441 us | 3.01% | 0.054 us | 0.35% | PASS |
I8 | I64 | 2^24 | 0 | 101.388 us | 0.58% | 102.082 us | 0.60% | 0.694 us | 0.68% | FAIL |
I8 | I64 | 2^28 | 0 | 1.449 ms | 0.09% | 1.461 ms | 0.09% | 11.505 us | 0.79% | FAIL |
I16 | I32 | 2^16 | 1 | 8.806 us | 5.79% | 8.805 us | 5.80% | -0.001 us | -0.02% | PASS |
I16 | I32 | 2^20 | 1 | 16.388 us | 2.82% | 16.433 us | 2.96% | 0.044 us | 0.27% | PASS |
I16 | I32 | 2^24 | 1 | 135.161 us | 0.96% | 135.245 us | 0.92% | 0.084 us | 0.06% | PASS |
I16 | I32 | 2^28 | 1 | 2.039 ms | 0.50% | 2.041 ms | 0.50% | 2.161 us | 0.11% | PASS |
I16 | I32 | 2^16 | 0.544 | 8.731 us | 5.92% | 8.854 us | 5.62% | 0.123 us | 1.41% | PASS |
I16 | I32 | 2^20 | 0.544 | 16.014 us | 3.30% | 15.970 us | 3.24% | -0.044 us | -0.28% | PASS |
I16 | I32 | 2^24 | 0.544 | 130.677 us | 1.07% | 130.215 us | 1.10% | -0.461 us | -0.35% | PASS |
I16 | I32 | 2^28 | 0.544 | 1.958 ms | 0.57% | 1.952 ms | 0.56% | -6.222 us | -0.32% | PASS |
I16 | I32 | 2^16 | 0 | 8.476 us | 5.43% | 8.560 us | 5.77% | 0.085 us | 1.00% | PASS |
I16 | I32 | 2^20 | 0 | 15.868 us | 3.29% | 15.895 us | 3.30% | 0.027 us | 0.17% | PASS |
I16 | I32 | 2^24 | 0 | 101.776 us | 0.81% | 101.793 us | 0.82% | 0.017 us | 0.02% | PASS |
I16 | I32 | 2^28 | 0 | 1.422 ms | 0.14% | 1.423 ms | 0.14% | 0.381 us | 0.03% | PASS |
I16 | I64 | 2^16 | 1 | 8.847 us | 5.62% | 8.997 us | 4.93% | 0.149 us | 1.69% | PASS |
I16 | I64 | 2^20 | 1 | 17.185 us | 3.16% | 16.984 us | 3.22% | -0.201 us | -1.17% | PASS |
I16 | I64 | 2^24 | 1 | 140.330 us | 0.81% | 137.644 us | 0.80% | -2.685 us | -1.91% | FAIL |
I16 | I64 | 2^28 | 1 | 2.120 ms | 0.50% | 2.078 ms | 0.50% | -41.668 us | -1.97% | FAIL |
I16 | I64 | 2^16 | 0.544 | 8.776 us | 5.84% | 8.942 us | 5.22% | 0.167 us | 1.90% | PASS |
I16 | I64 | 2^20 | 0.544 | 16.810 us | 3.29% | 16.431 us | 2.72% | -0.379 us | -2.25% | PASS |
I16 | I64 | 2^24 | 0.544 | 136.831 us | 0.90% | 134.197 us | 0.95% | -2.634 us | -1.93% | FAIL |
I16 | I64 | 2^28 | 0.544 | 2.045 ms | 0.52% | 2.000 ms | 0.54% | -44.590 us | -2.18% | FAIL |
I16 | I64 | 2^16 | 0 | 8.637 us | 5.91% | 8.665 us | 5.96% | 0.028 us | 0.32% | PASS |
I16 | I64 | 2^20 | 0 | 16.245 us | 3.05% | 16.169 us | 3.20% | -0.075 us | -0.46% | PASS |
I16 | I64 | 2^24 | 0 | 108.601 us | 0.73% | 105.566 us | 0.77% | -3.035 us | -2.79% | FAIL |
I16 | I64 | 2^28 | 0 | 1.538 ms | 0.10% | 1.486 ms | 0.12% | -51.870 us | -3.37% | FAIL |
I32 | I32 | 2^16 | 1 | 9.143 us | 4.47% | 9.240 us | 4.38% | 0.097 us | 1.06% | PASS |
I32 | I32 | 2^20 | 1 | 20.295 us | 3.35% | 20.258 us | 3.32% | -0.037 us | -0.18% | PASS |
I32 | I32 | 2^24 | 1 | 203.477 us | 0.63% | 203.377 us | 0.61% | -0.100 us | -0.05% | PASS |
I32 | I32 | 2^28 | 1 | 3.137 ms | 0.54% | 3.138 ms | 0.54% | 1.604 us | 0.05% | PASS |
I32 | I32 | 2^16 | 0.544 | 9.208 us | 4.74% | 9.293 us | 4.82% | 0.084 us | 0.92% | PASS |
I32 | I32 | 2^20 | 0.544 | 20.517 us | 3.75% | 20.583 us | 3.82% | 0.066 us | 0.32% | PASS |
I32 | I32 | 2^24 | 0.544 | 180.354 us | 0.89% | 180.218 us | 0.89% | -0.136 us | -0.08% | PASS |
I32 | I32 | 2^28 | 0.544 | 2.736 ms | 0.50% | 2.734 ms | 0.50% | -1.667 us | -0.06% | PASS |
I32 | I32 | 2^16 | 0 | 8.902 us | 5.67% | 8.897 us | 5.65% | -0.004 us | -0.05% | PASS |
I32 | I32 | 2^20 | 0 | 19.585 us | 2.80% | 19.577 us | 2.78% | -0.007 us | -0.04% | PASS |
I32 | I32 | 2^24 | 0 | 126.788 us | 0.72% | 126.793 us | 0.71% | 0.005 us | 0.00% | PASS |
I32 | I32 | 2^28 | 0 | 1.784 ms | 0.13% | 1.785 ms | 0.13% | 0.470 us | 0.03% | PASS |
I32 | I64 | 2^16 | 1 | 9.177 us | 4.60% | 9.261 us | 4.32% | 0.084 us | 0.92% | PASS |
I32 | I64 | 2^20 | 1 | 20.962 us | 4.07% | 20.697 us | 3.31% | -0.265 us | -1.26% | PASS |
I32 | I64 | 2^24 | 1 | 205.537 us | 0.61% | 204.144 us | 0.59% | -1.393 us | -0.68% | FAIL |
I32 | I64 | 2^28 | 1 | 3.176 ms | 0.52% | 3.149 ms | 0.53% | -27.667 us | -0.87% | FAIL |
I32 | I64 | 2^16 | 0.544 | 9.108 us | 4.98% | 10.090 us | 4.55% | 0.982 us | 10.78% | FAIL |
I32 | I64 | 2^20 | 0.544 | 21.098 us | 4.47% | 21.086 us | 3.71% | -0.012 us | -0.05% | PASS |
I32 | I64 | 2^24 | 0.544 | 183.903 us | 0.86% | 181.918 us | 0.81% | -1.985 us | -1.08% | FAIL |
I32 | I64 | 2^28 | 0.544 | 2.796 ms | 0.50% | 2.762 ms | 0.50% | -34.490 us | -1.23% | FAIL |
I32 | I64 | 2^16 | 0 | 8.927 us | 5.71% | 8.970 us | 5.54% | 0.042 us | 0.47% | PASS |
I32 | I64 | 2^20 | 0 | 19.952 us | 3.08% | 19.882 us | 2.93% | -0.069 us | -0.35% | PASS |
I32 | I64 | 2^24 | 0 | 131.903 us | 0.67% | 129.572 us | 0.69% | -2.330 us | -1.77% | FAIL |
I32 | I64 | 2^28 | 0 | 1.869 ms | 0.12% | 1.829 ms | 0.11% | -39.822 us | -2.13% | FAIL |
I64 | I32 | 2^16 | 1 | 10.117 us | 5.11% | 9.943 us | 4.91% | -0.174 us | -1.72% | PASS |
I64 | I32 | 2^20 | 1 | 31.058 us | 2.49% | 31.013 us | 2.50% | -0.045 us | -0.14% | PASS |
I64 | I32 | 2^24 | 1 | 370.652 us | 0.48% | 370.409 us | 0.50% | -0.243 us | -0.07% | PASS |
I64 | I32 | 2^28 | 1 | 5.805 ms | 0.50% | 5.804 ms | 0.50% | -1.308 us | -0.02% | PASS |
I64 | I32 | 2^16 | 0.544 | 9.993 us | 4.96% | 9.940 us | 4.98% | -0.053 us | -0.53% | PASS |
I64 | I32 | 2^20 | 0.544 | 28.720 us | 2.21% | 28.642 us | 2.13% | -0.078 us | -0.27% | PASS |
I64 | I32 | 2^24 | 0.544 | 306.502 us | 0.65% | 306.278 us | 0.62% | -0.224 us | -0.07% | PASS |
I64 | I32 | 2^28 | 0.544 | 4.752 ms | 0.50% | 4.752 ms | 0.50% | 0.217 us | 0.00% | PASS |
I64 | I32 | 2^16 | 0 | 9.705 us | 5.35% | 9.727 us | 5.34% | 0.022 us | 0.23% | PASS |
I64 | I32 | 2^20 | 0 | 28.156 us | 2.17% | 28.144 us | 2.13% | -0.011 us | -0.04% | PASS |
I64 | I32 | 2^24 | 0 | 207.893 us | 0.53% | 207.850 us | 0.52% | -0.043 us | -0.02% | PASS |
I64 | I32 | 2^28 | 0 | 3.068 ms | 0.13% | 3.070 ms | 0.11% | 1.659 us | 0.05% | PASS |
I64 | I64 | 2^16 | 1 | 10.227 us | 4.04% | 10.403 us | 4.00% | 0.176 us | 1.72% | PASS |
I64 | I64 | 2^20 | 1 | 31.023 us | 2.35% | 31.654 us | 2.12% | 0.630 us | 2.03% | PASS |
I64 | I64 | 2^24 | 1 | 372.213 us | 0.48% | 379.513 us | 0.54% | 7.299 us | 1.96% | FAIL |
I64 | I64 | 2^28 | 1 | 5.828 ms | 0.50% | 5.918 ms | 0.50% | 90.110 us | 1.55% | FAIL |
I64 | I64 | 2^16 | 0.544 | 10.179 us | 4.23% | 10.324 us | 4.19% | 0.146 us | 1.43% | PASS |
I64 | I64 | 2^20 | 0.544 | 28.675 us | 2.16% | 29.474 us | 3.23% | 0.799 us | 2.79% | FAIL |
I64 | I64 | 2^24 | 0.544 | 308.244 us | 0.62% | 316.340 us | 0.71% | 8.096 us | 2.63% | FAIL |
I64 | I64 | 2^28 | 0.544 | 4.782 ms | 0.50% | 4.888 ms | 0.50% | 105.948 us | 2.22% | FAIL |
I64 | I64 | 2^16 | 0 | 10.014 us | 4.63% | 10.098 us | 6.74% | 0.084 us | 0.84% | PASS |
I64 | I64 | 2^20 | 0 | 28.304 us | 2.15% | 28.871 us | 2.62% | 0.567 us | 2.00% | PASS |
I64 | I64 | 2^24 | 0 | 209.709 us | 0.45% | 218.592 us | 0.55% | 8.883 us | 4.24% | FAIL |
I64 | I64 | 2^28 | 0 | 3.098 ms | 0.10% | 3.246 ms | 0.12% | 147.642 us | 4.77% | FAIL |
I128 | I32 | 2^16 | 1 | 12.393 us | 3.27% | 12.385 us | 5.18% | -0.009 us | -0.07% | PASS |
I128 | I32 | 2^20 | 1 | 52.571 us | 1.33% | 54.041 us | 1.58% | 1.469 us | 2.80% | FAIL |
I128 | I32 | 2^24 | 1 | 718.611 us | 0.33% | 747.554 us | 0.40% | 28.943 us | 4.03% | FAIL |
I128 | I32 | 2^28 | 1 | 11.388 ms | 0.50% | 11.864 ms | 0.50% | 475.777 us | 4.18% | FAIL |
I128 | I32 | 2^16 | 0.544 | 12.433 us | 3.81% | 12.629 us | 6.03% | 0.196 us | 1.58% | PASS |
I128 | I32 | 2^20 | 0.544 | 46.625 us | 1.79% | 47.986 us | 2.15% | 1.361 us | 2.92% | FAIL |
I128 | I32 | 2^24 | 0.544 | 581.132 us | 0.55% | 619.256 us | 0.60% | 38.124 us | 6.56% | FAIL |
I128 | I32 | 2^28 | 0.544 | 9.150 ms | 0.62% | 9.781 ms | 0.50% | 631.109 us | 6.90% | FAIL |
I128 | I32 | 2^16 | 0 | 12.118 us | 3.82% | 12.048 us | 6.14% | -0.070 us | -0.58% | PASS |
I128 | I32 | 2^20 | 0 | 41.273 us | 1.47% | 41.965 us | 1.94% | 0.692 us | 1.68% | FAIL |
I128 | I32 | 2^24 | 0 | 382.251 us | 0.32% | 428.911 us | 0.36% | 46.660 us | 12.21% | FAIL |
I128 | I32 | 2^28 | 0 | 5.846 ms | 0.09% | 6.611 ms | 0.09% | 765.571 us | 13.10% | FAIL |
I128 | I64 | 2^16 | 1 | 11.925 us | 4.32% | 12.051 us | 5.92% | 0.127 us | 1.06% | PASS |
I128 | I64 | 2^20 | 1 | 53.803 us | 1.27% | 53.840 us | 1.65% | 0.037 us | 0.07% | PASS |
I128 | I64 | 2^24 | 1 | 738.681 us | 0.40% | 738.197 us | 0.38% | -0.484 us | -0.07% | PASS |
I128 | I64 | 2^28 | 1 | 11.703 ms | 0.50% | 11.698 ms | 0.50% | -4.917 us | -0.04% | PASS |
I128 | I64 | 2^16 | 0.544 | 12.058 us | 4.33% | 12.176 us | 6.14% | 0.117 us | 0.97% | PASS |
I128 | I64 | 2^20 | 0.544 | 47.704 us | 1.75% | 47.740 us | 2.10% | 0.036 us | 0.07% | PASS |
I128 | I64 | 2^24 | 0.544 | 607.297 us | 0.64% | 606.423 us | 0.64% | -0.874 us | -0.14% | PASS |
I128 | I64 | 2^28 | 0.544 | 9.583 ms | 0.50% | 9.576 ms | 0.50% | -7.073 us | -0.07% | PASS |
I128 | I64 | 2^16 | 0 | 11.783 us | 4.46% | 11.698 us | 7.18% | -0.085 us | -0.72% | PASS |
I128 | I64 | 2^20 | 0 | 42.010 us | 1.54% | 41.931 us | 1.98% | -0.079 us | -0.19% | PASS |
I128 | I64 | 2^24 | 0 | 418.861 us | 0.33% | 418.295 us | 0.36% | -0.566 us | -0.14% | PASS |
I128 | I64 | 2^28 | 0 | 6.457 ms | 0.08% | 6.451 ms | 0.08% | -6.630 us | -0.10% | FAIL |
F32 | I32 | 2^16 | 1 | 9.203 us | 4.37% | 9.199 us | 8.07% | -0.004 us | -0.05% | PASS |
F32 | I32 | 2^20 | 1 | 20.471 us | 3.35% | 20.450 us | 4.37% | -0.021 us | -0.10% | PASS |
F32 | I32 | 2^24 | 1 | 203.478 us | 0.63% | 203.665 us | 0.75% | 0.187 us | 0.09% | PASS |
F32 | I32 | 2^28 | 1 | 3.137 ms | 0.54% | 3.138 ms | 0.54% | 1.635 us | 0.05% | PASS |
F32 | I32 | 2^16 | 0.544 | 9.167 us | 4.75% | 9.313 us | 9.39% | 0.146 us | 1.60% | PASS |
F32 | I32 | 2^20 | 0.544 | 20.520 us | 3.77% | 20.558 us | 4.95% | 0.038 us | 0.19% | PASS |
F32 | I32 | 2^24 | 0.544 | 180.348 us | 0.87% | 180.288 us | 0.94% | -0.060 us | -0.03% | PASS |
F32 | I32 | 2^28 | 0.544 | 2.736 ms | 0.50% | 2.734 ms | 0.50% | -1.678 us | -0.06% | PASS |
F32 | I32 | 2^16 | 0 | 8.871 us | 5.96% | 8.968 us | 9.46% | 0.097 us | 1.09% | PASS |
F32 | I32 | 2^20 | 0 | 19.514 us | 2.73% | 19.576 us | 4.06% | 0.061 us | 0.31% | PASS |
F32 | I32 | 2^24 | 0 | 126.795 us | 0.71% | 126.776 us | 0.93% | -0.019 us | -0.01% | PASS |
F32 | I32 | 2^28 | 0 | 1.784 ms | 0.13% | 1.785 ms | 0.13% | 0.769 us | 0.04% | PASS |
F32 | I64 | 2^16 | 1 | 9.175 us | 4.59% | 9.424 us | 8.57% | 0.249 us | 2.72% | PASS |
F32 | I64 | 2^20 | 1 | 20.943 us | 4.00% | 21.025 us | 4.57% | 0.082 us | 0.39% | PASS |
F32 | I64 | 2^24 | 1 | 205.801 us | 0.62% | 204.511 us | 0.67% | -1.290 us | -0.63% | FAIL |
F32 | I64 | 2^28 | 1 | 3.176 ms | 0.52% | 3.149 ms | 0.53% | -27.577 us | -0.87% | FAIL |
F32 | I64 | 2^16 | 0.544 | 9.157 us | 4.83% | 9.431 us | 8.67% | 0.274 us | 2.99% | PASS |
F32 | I64 | 2^20 | 0.544 | 21.073 us | 4.45% | 21.065 us | 4.87% | -0.007 us | -0.03% | PASS |
F32 | I64 | 2^24 | 0.544 | 184.207 us | 0.85% | 182.129 us | 0.97% | -2.077 us | -1.13% | FAIL |
F32 | I64 | 2^28 | 0.544 | 2.796 ms | 0.50% | 2.761 ms | 0.50% | -34.517 us | -1.23% | FAIL |
F32 | I64 | 2^16 | 0 | 8.998 us | 5.44% | 9.086 us | 9.59% | 0.087 us | 0.97% | PASS |
F32 | I64 | 2^20 | 0 | 20.073 us | 3.05% | 19.984 us | 4.29% | -0.089 us | -0.44% | PASS |
F32 | I64 | 2^24 | 0 | 131.995 us | 0.69% | 129.567 us | 0.89% | -2.428 us | -1.84% | FAIL |
F32 | I64 | 2^28 | 0 | 1.869 ms | 0.12% | 1.829 ms | 0.13% | -39.656 us | -2.12% | FAIL |
F64 | I32 | 2^16 | 1 | 10.246 us | 4.19% | 10.128 us | 7.23% | -0.118 us | -1.15% | PASS |
F64 | I32 | 2^20 | 1 | 31.172 us | 2.48% | 31.335 us | 3.38% | 0.163 us | 0.52% | PASS |
F64 | I32 | 2^24 | 1 | 370.658 us | 0.49% | 370.547 us | 0.50% | -0.111 us | -0.03% | PASS |
F64 | I32 | 2^28 | 1 | 5.805 ms | 0.50% | 5.804 ms | 0.50% | -1.364 us | -0.02% | PASS |
F64 | I32 | 2^16 | 0.544 | 10.071 us | 5.15% | 10.145 us | 7.97% | 0.074 us | 0.74% | PASS |
F64 | I32 | 2^20 | 0.544 | 28.678 us | 2.15% | 28.797 us | 3.05% | 0.119 us | 0.42% | PASS |
F64 | I32 | 2^24 | 0.544 | 306.363 us | 0.63% | 306.311 us | 0.64% | -0.053 us | -0.02% | PASS |
F64 | I32 | 2^28 | 0.544 | 4.752 ms | 0.50% | 4.752 ms | 0.50% | 0.597 us | 0.01% | PASS |
F64 | I32 | 2^16 | 0 | 9.730 us | 5.38% | 9.759 us | 8.32% | 0.029 us | 0.29% | PASS |
F64 | I32 | 2^20 | 0 | 28.192 us | 2.19% | 28.126 us | 2.93% | -0.066 us | -0.23% | PASS |
F64 | I32 | 2^24 | 0 | 207.864 us | 0.52% | 207.919 us | 0.65% | 0.056 us | 0.03% | PASS |
F64 | I32 | 2^28 | 0 | 3.068 ms | 0.14% | 3.069 ms | 0.12% | 1.558 us | 0.05% | PASS |
F64 | I64 | 2^16 | 1 | 10.393 us | 3.97% | 10.634 us | 7.92% | 0.240 us | 2.31% | PASS |
F64 | I64 | 2^20 | 1 | 31.128 us | 2.36% | 31.797 us | 2.74% | 0.669 us | 2.15% | PASS |
F64 | I64 | 2^24 | 1 | 372.297 us | 0.47% | 379.746 us | 0.61% | 7.449 us | 2.00% | FAIL |
F64 | I64 | 2^28 | 1 | 5.828 ms | 0.50% | 5.918 ms | 0.50% | 89.889 us | 1.54% | FAIL |
F64 | I64 | 2^16 | 0.544 | 10.372 us | 4.66% | 10.526 us | 7.99% | 0.154 us | 1.48% | PASS |
F64 | I64 | 2^20 | 0.544 | 28.620 us | 2.10% | 29.502 us | 3.09% | 0.882 us | 3.08% | FAIL |
F64 | I64 | 2^24 | 0.544 | 308.197 us | 0.63% | 316.426 us | 0.70% | 8.230 us | 2.67% | FAIL |
F64 | I64 | 2^28 | 0.544 | 4.782 ms | 0.50% | 4.887 ms | 0.50% | 105.565 us | 2.21% | FAIL |
F64 | I64 | 2^16 | 0 | 9.998 us | 4.70% | 10.204 us | 7.74% | 0.205 us | 2.05% | PASS |
F64 | I64 | 2^20 | 0 | 28.309 us | 2.10% | 28.931 us | 3.21% | 0.622 us | 2.20% | FAIL |
F64 | I64 | 2^24 | 0 | 209.813 us | 0.48% | 218.718 us | 0.57% | 8.905 us | 4.24% | FAIL |
F64 | I64 | 2^28 | 0 | 3.098 ms | 0.09% | 3.245 ms | 0.13% | 147.587 us | 4.76% | FAIL |
Currently blocked by https://github.com/NVIDIA/cccl/issues/1454.
Turns out that there's some performance degradation from simply moving DeviceSelect
to use choose_offset_t
(see results above). Similarly, there's performance downside of as much as 50% if we were using i32
and i64
offset types here.
Given there's no easy choice for the offset type here, we want to revisit https://github.com/NVIDIA/cccl/issues/1454 and come to a conclusion for a broader approach of offset type handling first, before continuing on this endeavour.
We have some tickets potentially related to this in Pytorch like: https://github.com/pytorch/pytorch/issues/51871
Do you have an ETA for this?
Hey @bhack, this is something we're actively working on. Are there other specific algorithms that you're interested in?
Personally I hit this specific one for underline nonzero
implementation as other PyTorch users/developers.
I don't know if @ezyang have a more complete overview about other related priorities in the PyTorch context.
For nonzero
as sum
seems already covered in the table at:
https://github.com/NVIDIA/cccl/issues/50#issuecomment-1956325564
cub::DeviceSelect::Flagged
is the only one still needed for large N
:
cub::DeviceSelect::Flagged(nullptr, temp_storage_bytes, counting_itr, itr,
out_temp.mutable_data_ptr<int64_t>(), (int*)num_nonzeros.get(), N, stream);
temp_storage = allocator.allocate(temp_storage_bytes);
cub::DeviceSelect::Flagged(temp_storage.get(), temp_storage_bytes, counting_itr, itr,
out_temp.mutable_data_ptr<int64_t>(), (int*)num_nonzeros.get(), N, stream);
@elstehle We are having another problem related to this with the just release (but popular= model by Meta SAM2: https://github.com/facebookresearch/segment-anything-2/issues/44
Any progress on this?
Basically pytorch nonzero
ops rely on this.
Thank your for letting us know that this came up again, in another, very recent model, @bhack. We understand that this is of great importance to the community.
Unfortunately, there's no straight-forward solution that would not see significant slow-downs (in some cases 50% performance drops) when moving from 32 to 64-bit offset types. We are currently investigating options that mitigate performance drops when using 64-bit offset types. One such option is tracked here https://github.com/NVIDIA/cccl/issues/2136
Hi @elstehle
I noticed num_items
in cub::DeviceSelect::UniqueByKey
is a template parameter so the API can handle inputs larger than INT_MAX
. Could we do something similar to cub::DeviceSelect::If
as well?
This will unblock https://github.com/NVIDIA/cuCollections/issues/576 and https://github.com/rapidsai/cudf/issues/16526.
I noticed num_items in cub::DeviceSelect::UniqueByKey is a template parameter so the API can handle inputs larger than INT_MAX. Could we do something similar to cub::DeviceSelect::If as well?
In theory, yes, we could just make num_items
a template parameter but the algorithm performance is very susceptible to changes in the offset type used in the kernel template instantiation. We see a worst-case slow-down of 2.7x
, when simply switching to 64-bit offset types (see benchmark data).
So, we're trying various ways to mitigate these performance drops that come from using a wider offset type.
With a more sophisticated approach, we were able to mitigate this slowdown to only 1.3x
. However, this is still more than we would like to tolerate, if possible.
We will likely pursue a streaming approach for DeviceSelect
and DevicePartition
, similar to this experimental PR here that showed some promising results of a worst-case slowdown of only 4%
for 2^24
number of items and more compared to main
.
We will likely pursue a streaming approach for DeviceSelect and DevicePartition, similar to this experimental PR here that showed some promising results of a worst-case slowdown of only 4% for 2^24 number of items and more compared to main.
I like the streaming idea. The performance degradation with small inputs is IMO negligible since the overall runtime is no more than one millisecond. Thanks for the great work!