cuCollections
cuCollections copied to clipboard
Use invoke_one when possible
This PR updates new open addressing implementations to use cg::invoke_one
when possible.
It doesn't change legacy implementations like multimap or dynamic map, etc.
Looks good. Shall I run some benchmarks comparing perf on H100?
Please do. I expect a maximum difference of 0.5% to 1%.
🙅 Something's not right:
nvbench_compare.py dev_baseline.json invoke_one.json
['dev_baseline.json', 'invoke_one.json']
# static_set_contains_unique_occupancy
## [0] NVIDIA H100 PCIe
| Key | Distribution | Occupancy | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|-------|----------------|-------------|------------|-------------|------------|-------------|------------|---------|----------|
| I32 | UNIQUE | 0.1 | 6.664 ms | 0.02% | 6.861 ms | 0.01% | 197.133 us | 2.96% | FAIL |
| I32 | UNIQUE | 0.2 | 6.654 ms | 0.01% | 7.008 ms | 1.45% | 354.352 us | 5.33% | FAIL |
| I32 | UNIQUE | 0.3 | 6.712 ms | 0.01% | 7.108 ms | 0.93% | 395.841 us | 5.90% | FAIL |
| I32 | UNIQUE | 0.4 | 6.900 ms | 0.22% | 7.386 ms | 1.29% | 486.485 us | 7.05% | FAIL |
| I32 | UNIQUE | 0.5 | 7.246 ms | 0.30% | 7.728 ms | 0.76% | 482.108 us | 6.65% | FAIL |
| I32 | UNIQUE | 0.6 | 7.842 ms | 0.36% | 8.487 ms | 1.21% | 644.551 us | 8.22% | FAIL |
| I32 | UNIQUE | 0.7 | 8.845 ms | 0.32% | 9.441 ms | 0.65% | 596.367 us | 6.74% | FAIL |
| I32 | UNIQUE | 0.8 | 10.837 ms | 0.24% | 11.674 ms | 0.91% | 837.088 us | 7.72% | FAIL |
| I32 | UNIQUE | 0.9 | 16.634 ms | 0.02% | 17.461 ms | 0.78% | 826.831 us | 4.97% | FAIL |
| I64 | UNIQUE | 0.1 | 7.186 ms | 0.01% | 7.452 ms | 0.57% | 266.888 us | 3.71% | FAIL |
| I64 | UNIQUE | 0.2 | 7.276 ms | 2.25% | 7.549 ms | 1.54% | 273.354 us | 3.76% | FAIL |
| I64 | UNIQUE | 0.3 | 7.271 ms | 0.11% | 7.614 ms | 0.81% | 343.303 us | 4.72% | FAIL |
| I64 | UNIQUE | 0.4 | 7.470 ms | 0.30% | 7.866 ms | 1.05% | 396.046 us | 5.30% | FAIL |
| I64 | UNIQUE | 0.5 | 7.832 ms | 0.36% | 8.298 ms | 0.94% | 465.416 us | 5.94% | FAIL |
| I64 | UNIQUE | 0.6 | 8.467 ms | 0.40% | 9.011 ms | 1.13% | 544.171 us | 6.43% | FAIL |
| I64 | UNIQUE | 0.7 | 9.544 ms | 0.27% | 10.124 ms | 0.79% | 579.527 us | 6.07% | FAIL |
| I64 | UNIQUE | 0.8 | 11.669 ms | 0.02% | 12.375 ms | 0.90% | 706.240 us | 6.05% | FAIL |
| I64 | UNIQUE | 0.9 | 17.891 ms | 0.01% | 18.606 ms | 0.52% | 714.458 us | 3.99% | FAIL |
# static_set_contains_unique_matching_rate
## [0] NVIDIA H100 PCIe
| Key | Distribution | MatchingRate | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|-------|----------------|----------------|------------|-------------|------------|-------------|------------|---------|----------|
| I32 | UNIQUE | 0.1 | 7.620 ms | 0.01% | 8.306 ms | 1.20% | 685.944 us | 9.00% | FAIL |
| I32 | UNIQUE | 0.2 | 7.602 ms | 0.65% | 8.172 ms | 0.94% | 569.350 us | 7.49% | FAIL |
| I32 | UNIQUE | 0.3 | 7.451 ms | 0.30% | 8.103 ms | 1.08% | 652.094 us | 8.75% | FAIL |
| I32 | UNIQUE | 0.4 | 7.398 ms | 0.59% | 7.958 ms | 1.08% | 560.744 us | 7.58% | FAIL |
| I32 | UNIQUE | 0.5 | 7.257 ms | 0.44% | 7.833 ms | 0.90% | 576.493 us | 7.94% | FAIL |
| I32 | UNIQUE | 0.6 | 7.173 ms | 0.49% | 7.678 ms | 1.04% | 504.596 us | 7.03% | FAIL |
| I32 | UNIQUE | 0.7 | 7.041 ms | 0.40% | 7.538 ms | 0.85% | 497.576 us | 7.07% | FAIL |
| I32 | UNIQUE | 0.8 | 6.933 ms | 0.37% | 7.373 ms | 0.92% | 439.525 us | 6.34% | FAIL |
| I32 | UNIQUE | 0.9 | 6.793 ms | 0.24% | 7.176 ms | 0.74% | 383.007 us | 5.64% | FAIL |
| I32 | UNIQUE | 1 | 6.671 ms | 0.16% | 6.982 ms | 0.70% | 310.772 us | 4.66% | FAIL |
| I64 | UNIQUE | 0.1 | 8.265 ms | 0.57% | 8.853 ms | 1.08% | 587.467 us | 7.11% | FAIL |
| I64 | UNIQUE | 0.2 | 8.174 ms | 0.51% | 8.752 ms | 1.08% | 578.102 us | 7.07% | FAIL |
| I64 | UNIQUE | 0.3 | 8.068 ms | 0.54% | 8.648 ms | 1.06% | 580.303 us | 7.19% | FAIL |
| I64 | UNIQUE | 0.4 | 7.967 ms | 0.45% | 8.502 ms | 1.16% | 534.691 us | 6.71% | FAIL |
| I64 | UNIQUE | 0.5 | 7.852 ms | 0.43% | 8.377 ms | 1.10% | 525.209 us | 6.69% | FAIL |
| I64 | UNIQUE | 0.6 | 7.748 ms | 0.38% | 8.231 ms | 1.23% | 482.689 us | 6.23% | FAIL |
| I64 | UNIQUE | 0.7 | 7.630 ms | 0.34% | 8.055 ms | 0.97% | 424.666 us | 5.57% | FAIL |
| I64 | UNIQUE | 0.8 | 7.513 ms | 0.27% | 7.897 ms | 0.97% | 383.988 us | 5.11% | FAIL |
| I64 | UNIQUE | 0.9 | 7.393 ms | 0.20% | 7.691 ms | 0.69% | 297.481 us | 4.02% | FAIL |
| I64 | UNIQUE | 1 | 7.280 ms | 0.11% | 7.502 ms | 0.61% | 222.233 us | 3.05% | FAIL |
# static_set_constains_unique_capacity
## [0] NVIDIA H100 PCIe
| Key | Distribution | NumInputs | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|-------|----------------|-------------|------------|-------------|------------|-------------|------------|---------|----------|
| I32 | UNIQUE | 8000 | 14.185 us | 3.03% | 14.986 us | 2.69% | 0.801 us | 5.65% | FAIL |
| I32 | UNIQUE | 80000 | 17.819 us | 2.32% | 18.827 us | 2.12% | 1.008 us | 5.66% | FAIL |
| I32 | UNIQUE | 800000 | 55.041 us | 0.78% | 60.526 us | 0.67% | 5.485 us | 9.97% | FAIL |
| I32 | UNIQUE | 8000000 | 507.201 us | 0.15% | 577.336 us | 2.67% | 70.135 us | 13.83% | FAIL |
| I32 | UNIQUE | 80000000 | 5.816 ms | 0.50% | 7.297 ms | 9.96% | 1.481 ms | 25.46% | FAIL |
| I64 | UNIQUE | 8000 | 13.554 us | 2.92% | 15.237 us | 5.60% | 1.684 us | 12.42% | FAIL |
| I64 | UNIQUE | 80000 | 18.511 us | 2.28% | 19.172 us | 2.08% | 0.662 us | 3.57% | FAIL |
| I64 | UNIQUE | 800000 | 60.189 us | 0.79% | 70.294 us | 2.65% | 10.105 us | 16.79% | FAIL |
| I64 | UNIQUE | 8000000 | 582.640 us | 0.08% | 734.002 us | 1.92% | 151.362 us | 25.98% | FAIL |
| I64 | UNIQUE | 80000000 | 6.292 ms | 0.50% | 9.601 ms | 4.59% | 3.309 ms | 52.59% | FAIL |
# static_set_find_unique_occupancy
## [0] NVIDIA H100 PCIe
| Key | Distribution | Occupancy | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|-------|----------------|-------------|------------|-------------|------------|-------------|----------|---------|----------|
| I32 | UNIQUE | 0.1 | 7.095 ms | 0.09% | 11.024 ms | 2.16% | 3.929 ms | 55.38% | FAIL |
| I32 | UNIQUE | 0.2 | 7.133 ms | 0.69% | 11.091 ms | 2.76% | 3.958 ms | 55.49% | FAIL |
| I32 | UNIQUE | 0.3 | 7.242 ms | 0.45% | 11.406 ms | 2.24% | 4.164 ms | 57.50% | FAIL |
| I32 | UNIQUE | 0.4 | 7.506 ms | 0.68% | 11.936 ms | 2.21% | 4.430 ms | 59.02% | FAIL |
| I32 | UNIQUE | 0.5 | 7.914 ms | 0.48% | 12.732 ms | 2.25% | 4.818 ms | 60.89% | FAIL |
| I32 | UNIQUE | 0.6 | 8.584 ms | 0.59% | 13.417 ms | 1.40% | 4.833 ms | 56.30% | FAIL |
| I32 | UNIQUE | 0.7 | 9.676 ms | 0.49% | 15.218 ms | 1.60% | 5.543 ms | 57.29% | FAIL |
| I32 | UNIQUE | 0.8 | 11.817 ms | 0.40% | 18.288 ms | 0.91% | 6.471 ms | 54.76% | FAIL |
| I32 | UNIQUE | 0.9 | 17.999 ms | 0.10% | 27.307 ms | 0.05% | 9.308 ms | 51.71% | FAIL |
| I64 | UNIQUE | 0.1 | 7.741 ms | 0.31% | 12.284 ms | 3.79% | 4.543 ms | 58.68% | FAIL |
| I64 | UNIQUE | 0.2 | 7.825 ms | 1.44% | 12.573 ms | 3.70% | 4.748 ms | 60.68% | FAIL |
| I64 | UNIQUE | 0.3 | 7.890 ms | 0.70% | 12.697 ms | 4.35% | 4.807 ms | 60.93% | FAIL |
| I64 | UNIQUE | 0.4 | 8.166 ms | 0.85% | 13.026 ms | 3.04% | 4.860 ms | 59.51% | FAIL |
| I64 | UNIQUE | 0.5 | 8.676 ms | 0.80% | 13.747 ms | 2.64% | 5.071 ms | 58.44% | FAIL |
| I64 | UNIQUE | 0.6 | 9.442 ms | 0.66% | 14.046 ms | 1.22% | 4.604 ms | 48.76% | FAIL |
| I64 | UNIQUE | 0.7 | 10.714 ms | 0.58% | 15.509 ms | 1.53% | 4.795 ms | 44.75% | FAIL |
| I64 | UNIQUE | 0.8 | 13.057 ms | 0.23% | 18.114 ms | 1.41% | 5.057 ms | 38.73% | FAIL |
| I64 | UNIQUE | 0.9 | 19.831 ms | 0.13% | 26.428 ms | 0.22% | 6.597 ms | 33.27% | FAIL |
# static_set_find_unique_matching_rate
## [0] NVIDIA H100 PCIe
| Key | Distribution | MatchingRate | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|-------|----------------|----------------|------------|-------------|------------|-------------|----------|---------|----------|
| I32 | UNIQUE | 0.1 | 8.227 ms | 0.71% | 11.612 ms | 1.30% | 3.385 ms | 41.14% | FAIL |
| I32 | UNIQUE | 0.2 | 8.210 ms | 0.40% | 11.556 ms | 0.78% | 3.346 ms | 40.75% | FAIL |
| I32 | UNIQUE | 0.3 | 8.116 ms | 0.74% | 11.663 ms | 1.44% | 3.546 ms | 43.69% | FAIL |
| I32 | UNIQUE | 0.4 | 8.051 ms | 0.48% | 11.506 ms | 1.54% | 3.456 ms | 42.92% | FAIL |
| I32 | UNIQUE | 0.5 | 7.948 ms | 0.65% | 11.341 ms | 1.17% | 3.393 ms | 42.69% | FAIL |
| I32 | UNIQUE | 0.6 | 7.855 ms | 0.51% | 11.233 ms | 1.17% | 3.378 ms | 43.00% | FAIL |
| I32 | UNIQUE | 0.7 | 7.741 ms | 0.70% | 11.007 ms | 1.63% | 3.266 ms | 42.19% | FAIL |
| I32 | UNIQUE | 0.8 | 7.625 ms | 0.65% | 11.398 ms | 2.33% | 3.774 ms | 49.49% | FAIL |
| I32 | UNIQUE | 0.9 | 7.466 ms | 0.71% | 11.046 ms | 2.49% | 3.580 ms | 47.95% | FAIL |
| I32 | UNIQUE | 1 | 7.309 ms | 0.70% | 10.643 ms | 1.80% | 3.334 ms | 45.61% | FAIL |
| I64 | UNIQUE | 0.1 | 9.057 ms | 0.78% | 12.964 ms | 1.37% | 3.907 ms | 43.14% | FAIL |
| I64 | UNIQUE | 0.2 | 9.011 ms | 0.86% | 12.940 ms | 1.64% | 3.929 ms | 43.60% | FAIL |
| I64 | UNIQUE | 0.3 | 8.932 ms | 0.87% | 12.810 ms | 1.75% | 3.878 ms | 43.41% | FAIL |
| I64 | UNIQUE | 0.4 | 8.845 ms | 0.88% | 12.661 ms | 1.58% | 3.816 ms | 43.14% | FAIL |
| I64 | UNIQUE | 0.5 | 8.734 ms | 0.94% | 12.754 ms | 2.01% | 4.020 ms | 46.03% | FAIL |
| I64 | UNIQUE | 0.6 | 8.602 ms | 0.88% | 12.500 ms | 2.29% | 3.898 ms | 45.31% | FAIL |
| I64 | UNIQUE | 0.7 | 8.469 ms | 0.91% | 12.383 ms | 2.46% | 3.914 ms | 46.21% | FAIL |
| I64 | UNIQUE | 0.8 | 8.311 ms | 0.68% | 12.076 ms | 2.42% | 3.765 ms | 45.30% | FAIL |
| I64 | UNIQUE | 0.9 | 8.137 ms | 0.62% | 12.083 ms | 5.55% | 3.946 ms | 48.49% | FAIL |
| I64 | UNIQUE | 1 | 7.972 ms | 0.57% | 12.332 ms | 3.25% | 4.361 ms | 54.70% | FAIL |
# static_set_find_unique_capacity
## [0] NVIDIA H100 PCIe
| Key | Distribution | NumInputs | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|-------|----------------|-------------|------------|-------------|------------|-------------|------------|---------|----------|
| I32 | UNIQUE | 8000 | 14.053 us | 3.02% | 14.828 us | 6.93% | 0.775 us | 5.51% | FAIL |
| I32 | UNIQUE | 80000 | 18.155 us | 2.32% | 19.197 us | 1.94% | 1.041 us | 5.74% | FAIL |
| I32 | UNIQUE | 800000 | 60.050 us | 0.74% | 70.011 us | 2.03% | 9.961 us | 16.59% | FAIL |
| I32 | UNIQUE | 8000000 | 554.572 us | 0.18% | 757.652 us | 1.95% | 203.080 us | 36.62% | FAIL |
| I32 | UNIQUE | 80000000 | 6.406 ms | 0.64% | 8.986 ms | 4.27% | 2.580 ms | 40.27% | FAIL |
| I64 | UNIQUE | 8000 | 14.037 us | 2.43% | 15.018 us | 5.74% | 0.981 us | 6.99% | FAIL |
| I64 | UNIQUE | 80000 | 19.135 us | 2.04% | 20.040 us | 2.08% | 0.905 us | 4.73% | FAIL |
| I64 | UNIQUE | 800000 | 64.984 us | 0.66% | 80.014 us | 3.08% | 15.030 us | 23.13% | FAIL |
| I64 | UNIQUE | 8000000 | 656.400 us | 0.16% | 864.700 us | 1.42% | 208.299 us | 31.73% | FAIL |
| I64 | UNIQUE | 80000000 | 7.084 ms | 0.69% | 11.365 ms | 5.43% | 4.281 ms | 60.44% | FAIL |
# static_set_insert_uniform_multiplicity
## [0] NVIDIA H100 PCIe
| Key | Distribution | Multiplicity | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|-------|----------------|----------------|------------|-------------|------------|-------------|----------|---------|----------|
| I32 | UNIFORM | 1 | 14.208 ms | 0.08% | 17.293 ms | 2.01% | 3.086 ms | 21.72% | FAIL |
| I32 | UNIFORM | 2 | 11.030 ms | 0.04% | 14.260 ms | 1.02% | 3.230 ms | 29.28% | FAIL |
| I32 | UNIFORM | 4 | 9.134 ms | 0.07% | 12.766 ms | 0.86% | 3.632 ms | 39.76% | FAIL |
| I32 | UNIFORM | 8 | 8.527 ms | 0.22% | 11.326 ms | 0.59% | 2.799 ms | 32.82% | FAIL |
| I32 | UNIFORM | 16 | 7.808 ms | 0.13% | 10.478 ms | 1.69% | 2.669 ms | 34.18% | FAIL |
| I64 | UNIFORM | 1 | 15.489 ms | 0.05% | 17.663 ms | 0.37% | 2.175 ms | 14.04% | FAIL |
| I64 | UNIFORM | 2 | 12.099 ms | 0.06% | 14.200 ms | 0.44% | 2.101 ms | 17.37% | FAIL |
| I64 | UNIFORM | 4 | 9.933 ms | 0.13% | 12.724 ms | 1.53% | 2.792 ms | 28.11% | FAIL |
| I64 | UNIFORM | 8 | 9.158 ms | 0.13% | 12.108 ms | 0.96% | 2.950 ms | 32.22% | FAIL |
| I64 | UNIFORM | 16 | 8.635 ms | 0.16% | 11.404 ms | 1.11% | 2.768 ms | 32.06% | FAIL |
# static_set_insert_unique_occupancy
## [0] NVIDIA H100 PCIe
| Key | Distribution | Occupancy | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|-------|----------------|-------------|------------|-------------|------------|-------------|------------|---------|----------|
| I32 | UNIQUE | 0.1 | 16.724 ms | 0.06% | 18.636 ms | 0.80% | 1.912 ms | 11.43% | FAIL |
| I32 | UNIQUE | 0.2 | 15.823 ms | 0.04% | 17.637 ms | 0.51% | 1.814 ms | 11.47% | FAIL |
| I32 | UNIQUE | 0.3 | 15.474 ms | 0.08% | 17.114 ms | 0.12% | 1.639 ms | 10.59% | FAIL |
| I32 | UNIQUE | 0.4 | 15.284 ms | 0.04% | 16.717 ms | 0.44% | 1.433 ms | 9.38% | FAIL |
| I32 | UNIQUE | 0.5 | 15.224 ms | 0.05% | 16.751 ms | 0.27% | 1.527 ms | 10.03% | FAIL |
| I32 | UNIQUE | 0.6 | 15.311 ms | 0.04% | 17.107 ms | 0.22% | 1.797 ms | 11.73% | FAIL |
| I32 | UNIQUE | 0.7 | 15.589 ms | 0.04% | 17.884 ms | 0.31% | 2.295 ms | 14.72% | FAIL |
| I32 | UNIQUE | 0.8 | 16.241 ms | 0.26% | 19.113 ms | 0.38% | 2.872 ms | 17.68% | FAIL |
| I32 | UNIQUE | 0.9 | 17.800 ms | 0.13% | 21.516 ms | 0.61% | 3.716 ms | 20.87% | FAIL |
| I64 | UNIQUE | 0.1 | 16.292 ms | 1.99% | 16.504 ms | 2.01% | 212.216 us | 1.30% | PASS |
| I64 | UNIQUE | 0.2 | 17.457 ms | 8.43% | 17.531 ms | 8.31% | 73.924 us | 0.42% | PASS |
| I64 | UNIQUE | 0.3 | 16.937 ms | 0.06% | 17.048 ms | 0.23% | 110.982 us | 0.66% | FAIL |
| I64 | UNIQUE | 0.4 | 16.684 ms | 0.03% | 16.928 ms | 0.37% | 243.595 us | 1.46% | FAIL |
| I64 | UNIQUE | 0.5 | 16.596 ms | 0.03% | 17.089 ms | 0.54% | 493.461 us | 2.97% | FAIL |
| I64 | UNIQUE | 0.6 | 16.674 ms | 0.03% | 17.337 ms | 0.27% | 663.562 us | 3.98% | FAIL |
| I64 | UNIQUE | 0.7 | 16.961 ms | 0.03% | 18.082 ms | 0.74% | 1.121 ms | 6.61% | FAIL |
| I64 | UNIQUE | 0.8 | 17.626 ms | 0.11% | 19.294 ms | 0.27% | 1.668 ms | 9.46% | FAIL |
| I64 | UNIQUE | 0.9 | 19.194 ms | 0.36% | 21.339 ms | 0.43% | 2.145 ms | 11.17% | FAIL |
# static_set_insert_gaussian_skew
## [0] NVIDIA H100 PCIe
| Key | Distribution | Skew | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|-------|----------------|--------|------------|-------------|------------|-------------|------------|---------|----------|
| I32 | GAUSSIAN | 0.1 | 10.696 ms | 0.13% | 11.765 ms | 0.55% | 1.069 ms | 10.00% | FAIL |
| I32 | GAUSSIAN | 0.2 | 12.761 ms | 0.06% | 13.536 ms | 0.20% | 775.444 us | 6.08% | FAIL |
| I32 | GAUSSIAN | 0.3 | 13.603 ms | 0.22% | 14.444 ms | 0.79% | 840.478 us | 6.18% | FAIL |
| I32 | GAUSSIAN | 0.4 | 13.480 ms | 0.03% | 14.524 ms | 0.04% | 1.044 ms | 7.74% | FAIL |
| I32 | GAUSSIAN | 0.5 | 13.302 ms | 0.04% | 14.345 ms | 0.03% | 1.043 ms | 7.84% | FAIL |
| I32 | GAUSSIAN | 0.6 | 13.223 ms | 0.43% | 14.141 ms | 0.09% | 918.302 us | 6.94% | FAIL |
| I32 | GAUSSIAN | 0.7 | 13.228 ms | 0.06% | 14.147 ms | 0.17% | 919.431 us | 6.95% | FAIL |
| I32 | GAUSSIAN | 0.8 | 13.097 ms | 0.04% | 14.023 ms | 0.18% | 926.003 us | 7.07% | FAIL |
| I32 | GAUSSIAN | 0.9 | 12.904 ms | 0.05% | 13.951 ms | 1.25% | 1.047 ms | 8.11% | FAIL |
| I32 | GAUSSIAN | 1 | 12.836 ms | 0.04% | 14.116 ms | 0.15% | 1.280 ms | 9.97% | FAIL |
| I64 | GAUSSIAN | 0.1 | 11.884 ms | 0.70% | 13.078 ms | 1.38% | 1.194 ms | 10.05% | FAIL |
| I64 | GAUSSIAN | 0.2 | 14.225 ms | 0.25% | 15.534 ms | 1.41% | 1.309 ms | 9.20% | FAIL |
| I64 | GAUSSIAN | 0.3 | 15.099 ms | 0.14% | 16.640 ms | 0.70% | 1.542 ms | 10.21% | FAIL |
| I64 | GAUSSIAN | 0.4 | 14.926 ms | 0.15% | 16.668 ms | 0.43% | 1.743 ms | 11.68% | FAIL |
| I64 | GAUSSIAN | 0.5 | 14.725 ms | 0.11% | 16.525 ms | 0.52% | 1.800 ms | 12.22% | FAIL |
| I64 | GAUSSIAN | 0.6 | 14.518 ms | 0.07% | 16.290 ms | 0.58% | 1.772 ms | 12.21% | FAIL |
| I64 | GAUSSIAN | 0.7 | 14.504 ms | 0.15% | 16.715 ms | 1.78% | 2.211 ms | 15.24% | FAIL |
| I64 | GAUSSIAN | 0.8 | 14.392 ms | 0.31% | 16.794 ms | 1.03% | 2.402 ms | 16.69% | FAIL |
| I64 | GAUSSIAN | 0.9 | 14.222 ms | 0.13% | 16.510 ms | 1.03% | 2.288 ms | 16.09% | FAIL |
| I64 | GAUSSIAN | 1 | 14.182 ms | 0.45% | 16.510 ms | 1.00% | 2.328 ms | 16.42% | FAIL |
# static_set_retrieve_all_unique_occupancy
## [0] NVIDIA H100 PCIe
| Key | Distribution | Occupancy | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|-------|----------------|-------------|------------|-------------|------------|-------------|-------------|---------|----------|
| I32 | UNIQUE | 0.1 | 3.934 ms | 0.95% | 4.400 ms | 1.16% | 466.340 us | 11.86% | FAIL |
| I32 | UNIQUE | 0.2 | 2.158 ms | 0.41% | 2.467 ms | 0.72% | 309.386 us | 14.34% | FAIL |
| I32 | UNIQUE | 0.3 | 1.541 ms | 0.33% | 1.772 ms | 0.95% | 231.461 us | 15.02% | FAIL |
| I32 | UNIQUE | 0.4 | 1.218 ms | 0.42% | 1.424 ms | 0.46% | 206.088 us | 16.92% | FAIL |
| I32 | UNIQUE | 0.5 | 1.045 ms | 0.63% | 1.229 ms | 0.71% | 184.511 us | 17.66% | FAIL |
| I32 | UNIQUE | 0.6 | 926.631 us | 0.58% | 1.136 ms | 1.24% | 209.750 us | 22.64% | FAIL |
| I32 | UNIQUE | 0.7 | 860.071 us | 0.76% | 1.028 ms | 1.01% | 167.593 us | 19.49% | FAIL |
| I32 | UNIQUE | 0.8 | 798.632 us | 1.00% | 944.341 us | 1.13% | 145.709 us | 18.24% | FAIL |
| I32 | UNIQUE | 0.9 | 779.046 us | 1.32% | 896.476 us | 1.97% | 117.430 us | 15.07% | FAIL |
| I64 | UNIQUE | 0.1 | 5.836 ms | 0.24% | 6.243 ms | 0.43% | 406.339 us | 6.96% | FAIL |
| I64 | UNIQUE | 0.2 | 3.262 ms | 5.29% | 3.473 ms | 5.18% | 210.734 us | 6.46% | FAIL |
| I64 | UNIQUE | 0.3 | 2.378 ms | 0.80% | 2.519 ms | 0.47% | 140.819 us | 5.92% | FAIL |
| I64 | UNIQUE | 0.4 | 2.021 ms | 1.51% | 2.058 ms | 0.58% | 37.018 us | 1.83% | FAIL |
| I64 | UNIQUE | 0.5 | 1.835 ms | 2.18% | 1.783 ms | 0.66% | -51.480 us | -2.81% | FAIL |
| I64 | UNIQUE | 0.6 | 1.689 ms | 2.23% | 1.593 ms | 0.83% | -96.192 us | -5.69% | FAIL |
| I64 | UNIQUE | 0.7 | 1.606 ms | 0.75% | 1.479 ms | 0.82% | -127.319 us | -7.93% | FAIL |
| I64 | UNIQUE | 0.8 | 1.472 ms | 0.83% | 1.377 ms | 0.93% | -94.950 us | -6.45% | FAIL |
| I64 | UNIQUE | 0.9 | 1.381 ms | 1.00% | 1.304 ms | 0.95% | -77.287 us | -5.60% | FAIL |
# static_set_size_unique_occupancy
## [0] NVIDIA H100 PCIe
| Key | Distribution | Occupancy | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|-------|----------------|-------------|------------|-------------|------------|-------------|-------------|---------|----------|
| I32 | UNIQUE | 0.1 | 12.613 ms | 3.43% | 12.014 ms | 2.30% | -599.767 us | -4.75% | FAIL |
| I32 | UNIQUE | 0.2 | 5.748 ms | 3.63% | 5.741 ms | 2.66% | -6.662 us | -0.12% | PASS |
| I32 | UNIQUE | 0.3 | 3.567 ms | 2.40% | 3.611 ms | 2.87% | 43.668 us | 1.22% | PASS |
| I32 | UNIQUE | 0.4 | 2.594 ms | 1.03% | 2.619 ms | 0.91% | 24.665 us | 0.95% | FAIL |
| I32 | UNIQUE | 0.5 | 2.101 ms | 1.00% | 2.125 ms | 0.77% | 24.366 us | 1.16% | FAIL |
| I32 | UNIQUE | 0.6 | 1.773 ms | 1.39% | 1.825 ms | 1.15% | 52.042 us | 2.93% | FAIL |
| I32 | UNIQUE | 0.7 | 1.526 ms | 1.33% | 1.572 ms | 1.29% | 46.180 us | 3.03% | FAIL |
| I32 | UNIQUE | 0.8 | 1.342 ms | 1.15% | 1.391 ms | 1.38% | 48.933 us | 3.65% | FAIL |
| I32 | UNIQUE | 0.9 | 1.202 ms | 1.79% | 1.236 ms | 1.86% | 33.991 us | 2.83% | FAIL |
| I64 | UNIQUE | 0.1 | 12.018 ms | 0.99% | 12.427 ms | 0.91% | 409.239 us | 3.41% | FAIL |
| I64 | UNIQUE | 0.2 | 6.069 ms | 4.42% | 6.190 ms | 4.13% | 121.385 us | 2.00% | PASS |
| I64 | UNIQUE | 0.3 | 4.095 ms | 0.29% | 4.226 ms | 0.41% | 131.192 us | 3.20% | FAIL |
| I64 | UNIQUE | 0.4 | 3.114 ms | 0.78% | 3.215 ms | 1.13% | 101.358 us | 3.26% | FAIL |
| I64 | UNIQUE | 0.5 | 2.548 ms | 1.03% | 2.603 ms | 1.16% | 55.490 us | 2.18% | FAIL |
| I64 | UNIQUE | 0.6 | 2.167 ms | 1.50% | 2.194 ms | 0.91% | 26.927 us | 1.24% | FAIL |
| I64 | UNIQUE | 0.7 | 1.867 ms | 1.42% | 1.956 ms | 1.07% | 88.928 us | 4.76% | FAIL |
| I64 | UNIQUE | 0.8 | 1.648 ms | 1.26% | 1.730 ms | 1.64% | 82.023 us | 4.98% | FAIL |
| I64 | UNIQUE | 0.9 | 1.475 ms | 1.78% | 1.541 ms | 1.70% | 65.659 us | 4.45% | FAIL |
# static_set_rehash_unique_occupancy
## [0] NVIDIA H100 PCIe
| Key | Distribution | Occupancy | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|-------|----------------|-------------|------------|-------------|------------|-------------|------------|---------|----------|
| I32 | UNIQUE | 0.1 | 6.762 ms | 0.98% | 6.864 ms | 2.11% | 102.812 us | 1.52% | FAIL |
| I32 | UNIQUE | 0.2 | 7.956 ms | 0.48% | 8.097 ms | 0.51% | 140.946 us | 1.77% | FAIL |
| I32 | UNIQUE | 0.3 | 10.891 ms | 0.92% | 11.111 ms | 1.05% | 219.960 us | 2.02% | FAIL |
| I32 | UNIQUE | 0.4 | 13.595 ms | 1.34% | 13.860 ms | 1.06% | 265.411 us | 1.95% | FAIL |
| I32 | UNIQUE | 0.5 | 15.988 ms | 1.27% | 16.596 ms | 1.47% | 608.188 us | 3.80% | FAIL |
| I32 | UNIQUE | 0.6 | 19.677 ms | 2.06% | 20.418 ms | 2.25% | 741.516 us | 3.77% | FAIL |
| I32 | UNIQUE | 0.7 | 23.233 ms | 2.42% | 24.000 ms | 2.31% | 766.825 us | 3.30% | FAIL |
| I32 | UNIQUE | 0.8 | 28.444 ms | 2.56% | 29.421 ms | 2.35% | 976.883 us | 3.43% | FAIL |
| I32 | UNIQUE | 0.9 | 36.253 ms | 2.12% | 37.452 ms | 1.95% | 1.200 ms | 3.31% | FAIL |
| I64 | UNIQUE | 0.1 | 7.466 ms | 0.30% | 7.569 ms | 0.75% | 102.416 us | 1.37% | FAIL |
| I64 | UNIQUE | 0.2 | 8.749 ms | 0.39% | 8.916 ms | 0.49% | 167.402 us | 1.91% | FAIL |
| I64 | UNIQUE | 0.3 | 11.827 ms | 0.84% | 12.086 ms | 1.25% | 259.724 us | 2.20% | FAIL |
| I64 | UNIQUE | 0.4 | 14.539 ms | 1.51% | 15.006 ms | 0.96% | 466.828 us | 3.21% | FAIL |
| I64 | UNIQUE | 0.5 | 17.157 ms | 1.61% | 17.646 ms | 1.58% | 488.846 us | 2.85% | FAIL |
| I64 | UNIQUE | 0.6 | 20.882 ms | 1.93% | 21.546 ms | 1.61% | 663.449 us | 3.18% | FAIL |
| I64 | UNIQUE | 0.7 | 24.540 ms | 1.97% | 25.327 ms | 2.28% | 787.724 us | 3.21% | FAIL |
| I64 | UNIQUE | 0.8 | 29.950 ms | 2.37% | 30.745 ms | 2.25% | 794.890 us | 2.65% | FAIL |
| I64 | UNIQUE | 0.9 | 37.942 ms | 1.81% | 38.965 ms | 2.12% | 1.022 ms | 2.69% | FAIL |
# Summary
- Total Matches: 198
- Pass (diff <= min_noise): 5
- Unknown (infinite noise): 0
- Failure (diff > min_noise): 193
About 50% slower for find
:scream:
I cannot reproduce the performance regression with my local RTX8000:
# static_set_find_unique_occupancy
## [0] Quadro RTX 8000
| Key | Distribution | Occupancy | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|-------|----------------|-------------|------------|-------------|------------|-------------|------------|---------|----------|
| I32 | UNIQUE | 0.1 | 31.941 ms | 0.28% | 31.925 ms | 0.14% | -16.024 us | -0.05% | PASS |
| I32 | UNIQUE | 0.2 | 31.934 ms | 0.01% | 31.939 ms | 0.02% | 4.440 us | 0.01% | FAIL |
| I32 | UNIQUE | 0.3 | 32.245 ms | 0.01% | 32.242 ms | 0.01% | -2.355 us | -0.01% | PASS |
| I32 | UNIQUE | 0.4 | 33.012 ms | 0.02% | 33.002 ms | 0.01% | -10.054 us | -0.03% | FAIL |
| I32 | UNIQUE | 0.5 | 34.432 ms | 0.02% | 34.449 ms | 0.01% | 16.904 us | 0.05% | FAIL |
| I32 | UNIQUE | 0.6 | 36.910 ms | 0.01% | 36.915 ms | 0.01% | 4.985 us | 0.01% | FAIL |
| I32 | UNIQUE | 0.7 | 41.218 ms | 0.01% | 41.223 ms | 0.02% | 5.061 us | 0.01% | FAIL |
| I32 | UNIQUE | 0.8 | 49.906 ms | 0.01% | 49.917 ms | 0.01% | 11.477 us | 0.02% | FAIL |
| I32 | UNIQUE | 0.9 | 75.340 ms | 0.02% | 75.365 ms | 0.04% | 24.811 us | 0.03% | FAIL |
| I64 | UNIQUE | 0.1 | 33.937 ms | 0.01% | 33.950 ms | 0.01% | 12.970 us | 0.04% | FAIL |
| I64 | UNIQUE | 0.2 | 34.052 ms | 0.02% | 34.045 ms | 0.01% | -7.124 us | -0.02% | FAIL |
| I64 | UNIQUE | 0.3 | 34.408 ms | 0.02% | 34.429 ms | 0.01% | 21.481 us | 0.06% | FAIL |
| I64 | UNIQUE | 0.4 | 35.258 ms | 0.02% | 35.275 ms | 0.02% | 17.199 us | 0.05% | FAIL |
| I64 | UNIQUE | 0.5 | 36.791 ms | 0.01% | 36.802 ms | 0.01% | 11.845 us | 0.03% | FAIL |
| I64 | UNIQUE | 0.6 | 39.389 ms | 0.02% | 39.397 ms | 0.01% | 8.795 us | 0.02% | FAIL |
| I64 | UNIQUE | 0.7 | 43.888 ms | 0.01% | 43.890 ms | 0.01% | 2.224 us | 0.01% | PASS |
| I64 | UNIQUE | 0.8 | 52.894 ms | 0.02% | 52.914 ms | 0.01% | 20.027 us | 0.04% | FAIL |
| I64 | UNIQUE | 0.9 | 79.321 ms | 0.02% | 79.380 ms | 0.04% | 59.164 us | 0.07% | FAIL |
Could this issue be H100-specific?
That's expected on a <sm_90 arch. From sm_90 going forward the function will use a different code path leveraging the new ELECT
instruction. Let me collect some profiles on H100 so we can investigate what's going on.
Update: I ran the same exact benchmarks on another H100 node (CTK 12.3) today and wasn't able to reproduce the regression I initially reported. Will run another test on a different node and report back.
@sleeepyjack Any updates on H100 perf results?
I tested another H100 HBM3 node with CTK 12.3 and the performance regression is still present although much less pronounced (around 5-6% less throughput).
The instruction diff between the baseline and this version is straightforward: The baseline
if (tile.thread_rank() == 0) {...}
compiles to
ISETP.NE.AND P0, PT, R7, RZ, PT
which sets P0=true
for the leader thread.
The cg::invoke_one
version translates to
IMAD.MOV.U32 R15, RZ, RZ, 0xf
BSSY B0, 0x7f8f8ec57d60
SHF.L.U32 R15, R15, R8, RZ
WARPSYNC.EXCLUSIVE R15
ELECT P0, URZ, ~URZ
BSYNC B0
Compared to the baseline, the new version additionally computes the mask of participating threads (SHF
) which is then used to synchronize these threads (WARPSYNC
) before electing a leader (ELECT
).
I have some profiles which I will share via Slack since I had to use an internal toolchain.