cuCollections icon indicating copy to clipboard operation
cuCollections copied to clipboard

Use invoke_one when possible

Open PointKernel opened this issue 11 months ago • 8 comments

This PR updates new open addressing implementations to use cg::invoke_one when possible.

It doesn't change legacy implementations like multimap or dynamic map, etc.

PointKernel avatar Mar 21 '24 21:03 PointKernel

Looks good. Shall I run some benchmarks comparing perf on H100?

Please do. I expect a maximum difference of 0.5% to 1%.

PointKernel avatar Mar 22 '24 04:03 PointKernel

🙅 Something's not right:

nvbench_compare.py dev_baseline.json invoke_one.json
['dev_baseline.json', 'invoke_one.json']
# static_set_contains_unique_occupancy

## [0] NVIDIA H100 PCIe

|  Key  |  Distribution  |  Occupancy  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|-------|----------------|-------------|------------|-------------|------------|-------------|------------|---------|----------|
|  I32  |     UNIQUE     |     0.1     |   6.664 ms |       0.02% |   6.861 ms |       0.01% | 197.133 us |   2.96% |   FAIL   |
|  I32  |     UNIQUE     |     0.2     |   6.654 ms |       0.01% |   7.008 ms |       1.45% | 354.352 us |   5.33% |   FAIL   |
|  I32  |     UNIQUE     |     0.3     |   6.712 ms |       0.01% |   7.108 ms |       0.93% | 395.841 us |   5.90% |   FAIL   |
|  I32  |     UNIQUE     |     0.4     |   6.900 ms |       0.22% |   7.386 ms |       1.29% | 486.485 us |   7.05% |   FAIL   |
|  I32  |     UNIQUE     |     0.5     |   7.246 ms |       0.30% |   7.728 ms |       0.76% | 482.108 us |   6.65% |   FAIL   |
|  I32  |     UNIQUE     |     0.6     |   7.842 ms |       0.36% |   8.487 ms |       1.21% | 644.551 us |   8.22% |   FAIL   |
|  I32  |     UNIQUE     |     0.7     |   8.845 ms |       0.32% |   9.441 ms |       0.65% | 596.367 us |   6.74% |   FAIL   |
|  I32  |     UNIQUE     |     0.8     |  10.837 ms |       0.24% |  11.674 ms |       0.91% | 837.088 us |   7.72% |   FAIL   |
|  I32  |     UNIQUE     |     0.9     |  16.634 ms |       0.02% |  17.461 ms |       0.78% | 826.831 us |   4.97% |   FAIL   |
|  I64  |     UNIQUE     |     0.1     |   7.186 ms |       0.01% |   7.452 ms |       0.57% | 266.888 us |   3.71% |   FAIL   |
|  I64  |     UNIQUE     |     0.2     |   7.276 ms |       2.25% |   7.549 ms |       1.54% | 273.354 us |   3.76% |   FAIL   |
|  I64  |     UNIQUE     |     0.3     |   7.271 ms |       0.11% |   7.614 ms |       0.81% | 343.303 us |   4.72% |   FAIL   |
|  I64  |     UNIQUE     |     0.4     |   7.470 ms |       0.30% |   7.866 ms |       1.05% | 396.046 us |   5.30% |   FAIL   |
|  I64  |     UNIQUE     |     0.5     |   7.832 ms |       0.36% |   8.298 ms |       0.94% | 465.416 us |   5.94% |   FAIL   |
|  I64  |     UNIQUE     |     0.6     |   8.467 ms |       0.40% |   9.011 ms |       1.13% | 544.171 us |   6.43% |   FAIL   |
|  I64  |     UNIQUE     |     0.7     |   9.544 ms |       0.27% |  10.124 ms |       0.79% | 579.527 us |   6.07% |   FAIL   |
|  I64  |     UNIQUE     |     0.8     |  11.669 ms |       0.02% |  12.375 ms |       0.90% | 706.240 us |   6.05% |   FAIL   |
|  I64  |     UNIQUE     |     0.9     |  17.891 ms |       0.01% |  18.606 ms |       0.52% | 714.458 us |   3.99% |   FAIL   |

# static_set_contains_unique_matching_rate

## [0] NVIDIA H100 PCIe

|  Key  |  Distribution  |  MatchingRate  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|-------|----------------|----------------|------------|-------------|------------|-------------|------------|---------|----------|
|  I32  |     UNIQUE     |      0.1       |   7.620 ms |       0.01% |   8.306 ms |       1.20% | 685.944 us |   9.00% |   FAIL   |
|  I32  |     UNIQUE     |      0.2       |   7.602 ms |       0.65% |   8.172 ms |       0.94% | 569.350 us |   7.49% |   FAIL   |
|  I32  |     UNIQUE     |      0.3       |   7.451 ms |       0.30% |   8.103 ms |       1.08% | 652.094 us |   8.75% |   FAIL   |
|  I32  |     UNIQUE     |      0.4       |   7.398 ms |       0.59% |   7.958 ms |       1.08% | 560.744 us |   7.58% |   FAIL   |
|  I32  |     UNIQUE     |      0.5       |   7.257 ms |       0.44% |   7.833 ms |       0.90% | 576.493 us |   7.94% |   FAIL   |
|  I32  |     UNIQUE     |      0.6       |   7.173 ms |       0.49% |   7.678 ms |       1.04% | 504.596 us |   7.03% |   FAIL   |
|  I32  |     UNIQUE     |      0.7       |   7.041 ms |       0.40% |   7.538 ms |       0.85% | 497.576 us |   7.07% |   FAIL   |
|  I32  |     UNIQUE     |      0.8       |   6.933 ms |       0.37% |   7.373 ms |       0.92% | 439.525 us |   6.34% |   FAIL   |
|  I32  |     UNIQUE     |      0.9       |   6.793 ms |       0.24% |   7.176 ms |       0.74% | 383.007 us |   5.64% |   FAIL   |
|  I32  |     UNIQUE     |       1        |   6.671 ms |       0.16% |   6.982 ms |       0.70% | 310.772 us |   4.66% |   FAIL   |
|  I64  |     UNIQUE     |      0.1       |   8.265 ms |       0.57% |   8.853 ms |       1.08% | 587.467 us |   7.11% |   FAIL   |
|  I64  |     UNIQUE     |      0.2       |   8.174 ms |       0.51% |   8.752 ms |       1.08% | 578.102 us |   7.07% |   FAIL   |
|  I64  |     UNIQUE     |      0.3       |   8.068 ms |       0.54% |   8.648 ms |       1.06% | 580.303 us |   7.19% |   FAIL   |
|  I64  |     UNIQUE     |      0.4       |   7.967 ms |       0.45% |   8.502 ms |       1.16% | 534.691 us |   6.71% |   FAIL   |
|  I64  |     UNIQUE     |      0.5       |   7.852 ms |       0.43% |   8.377 ms |       1.10% | 525.209 us |   6.69% |   FAIL   |
|  I64  |     UNIQUE     |      0.6       |   7.748 ms |       0.38% |   8.231 ms |       1.23% | 482.689 us |   6.23% |   FAIL   |
|  I64  |     UNIQUE     |      0.7       |   7.630 ms |       0.34% |   8.055 ms |       0.97% | 424.666 us |   5.57% |   FAIL   |
|  I64  |     UNIQUE     |      0.8       |   7.513 ms |       0.27% |   7.897 ms |       0.97% | 383.988 us |   5.11% |   FAIL   |
|  I64  |     UNIQUE     |      0.9       |   7.393 ms |       0.20% |   7.691 ms |       0.69% | 297.481 us |   4.02% |   FAIL   |
|  I64  |     UNIQUE     |       1        |   7.280 ms |       0.11% |   7.502 ms |       0.61% | 222.233 us |   3.05% |   FAIL   |

# static_set_constains_unique_capacity

## [0] NVIDIA H100 PCIe

|  Key  |  Distribution  |  NumInputs  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|-------|----------------|-------------|------------|-------------|------------|-------------|------------|---------|----------|
|  I32  |     UNIQUE     |    8000     |  14.185 us |       3.03% |  14.986 us |       2.69% |   0.801 us |   5.65% |   FAIL   |
|  I32  |     UNIQUE     |    80000    |  17.819 us |       2.32% |  18.827 us |       2.12% |   1.008 us |   5.66% |   FAIL   |
|  I32  |     UNIQUE     |   800000    |  55.041 us |       0.78% |  60.526 us |       0.67% |   5.485 us |   9.97% |   FAIL   |
|  I32  |     UNIQUE     |   8000000   | 507.201 us |       0.15% | 577.336 us |       2.67% |  70.135 us |  13.83% |   FAIL   |
|  I32  |     UNIQUE     |  80000000   |   5.816 ms |       0.50% |   7.297 ms |       9.96% |   1.481 ms |  25.46% |   FAIL   |
|  I64  |     UNIQUE     |    8000     |  13.554 us |       2.92% |  15.237 us |       5.60% |   1.684 us |  12.42% |   FAIL   |
|  I64  |     UNIQUE     |    80000    |  18.511 us |       2.28% |  19.172 us |       2.08% |   0.662 us |   3.57% |   FAIL   |
|  I64  |     UNIQUE     |   800000    |  60.189 us |       0.79% |  70.294 us |       2.65% |  10.105 us |  16.79% |   FAIL   |
|  I64  |     UNIQUE     |   8000000   | 582.640 us |       0.08% | 734.002 us |       1.92% | 151.362 us |  25.98% |   FAIL   |
|  I64  |     UNIQUE     |  80000000   |   6.292 ms |       0.50% |   9.601 ms |       4.59% |   3.309 ms |  52.59% |   FAIL   |

# static_set_find_unique_occupancy

## [0] NVIDIA H100 PCIe

|  Key  |  Distribution  |  Occupancy  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |     Diff |   %Diff |  Status  |
|-------|----------------|-------------|------------|-------------|------------|-------------|----------|---------|----------|
|  I32  |     UNIQUE     |     0.1     |   7.095 ms |       0.09% |  11.024 ms |       2.16% | 3.929 ms |  55.38% |   FAIL   |
|  I32  |     UNIQUE     |     0.2     |   7.133 ms |       0.69% |  11.091 ms |       2.76% | 3.958 ms |  55.49% |   FAIL   |
|  I32  |     UNIQUE     |     0.3     |   7.242 ms |       0.45% |  11.406 ms |       2.24% | 4.164 ms |  57.50% |   FAIL   |
|  I32  |     UNIQUE     |     0.4     |   7.506 ms |       0.68% |  11.936 ms |       2.21% | 4.430 ms |  59.02% |   FAIL   |
|  I32  |     UNIQUE     |     0.5     |   7.914 ms |       0.48% |  12.732 ms |       2.25% | 4.818 ms |  60.89% |   FAIL   |
|  I32  |     UNIQUE     |     0.6     |   8.584 ms |       0.59% |  13.417 ms |       1.40% | 4.833 ms |  56.30% |   FAIL   |
|  I32  |     UNIQUE     |     0.7     |   9.676 ms |       0.49% |  15.218 ms |       1.60% | 5.543 ms |  57.29% |   FAIL   |
|  I32  |     UNIQUE     |     0.8     |  11.817 ms |       0.40% |  18.288 ms |       0.91% | 6.471 ms |  54.76% |   FAIL   |
|  I32  |     UNIQUE     |     0.9     |  17.999 ms |       0.10% |  27.307 ms |       0.05% | 9.308 ms |  51.71% |   FAIL   |
|  I64  |     UNIQUE     |     0.1     |   7.741 ms |       0.31% |  12.284 ms |       3.79% | 4.543 ms |  58.68% |   FAIL   |
|  I64  |     UNIQUE     |     0.2     |   7.825 ms |       1.44% |  12.573 ms |       3.70% | 4.748 ms |  60.68% |   FAIL   |
|  I64  |     UNIQUE     |     0.3     |   7.890 ms |       0.70% |  12.697 ms |       4.35% | 4.807 ms |  60.93% |   FAIL   |
|  I64  |     UNIQUE     |     0.4     |   8.166 ms |       0.85% |  13.026 ms |       3.04% | 4.860 ms |  59.51% |   FAIL   |
|  I64  |     UNIQUE     |     0.5     |   8.676 ms |       0.80% |  13.747 ms |       2.64% | 5.071 ms |  58.44% |   FAIL   |
|  I64  |     UNIQUE     |     0.6     |   9.442 ms |       0.66% |  14.046 ms |       1.22% | 4.604 ms |  48.76% |   FAIL   |
|  I64  |     UNIQUE     |     0.7     |  10.714 ms |       0.58% |  15.509 ms |       1.53% | 4.795 ms |  44.75% |   FAIL   |
|  I64  |     UNIQUE     |     0.8     |  13.057 ms |       0.23% |  18.114 ms |       1.41% | 5.057 ms |  38.73% |   FAIL   |
|  I64  |     UNIQUE     |     0.9     |  19.831 ms |       0.13% |  26.428 ms |       0.22% | 6.597 ms |  33.27% |   FAIL   |

# static_set_find_unique_matching_rate

## [0] NVIDIA H100 PCIe

|  Key  |  Distribution  |  MatchingRate  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |     Diff |   %Diff |  Status  |
|-------|----------------|----------------|------------|-------------|------------|-------------|----------|---------|----------|
|  I32  |     UNIQUE     |      0.1       |   8.227 ms |       0.71% |  11.612 ms |       1.30% | 3.385 ms |  41.14% |   FAIL   |
|  I32  |     UNIQUE     |      0.2       |   8.210 ms |       0.40% |  11.556 ms |       0.78% | 3.346 ms |  40.75% |   FAIL   |
|  I32  |     UNIQUE     |      0.3       |   8.116 ms |       0.74% |  11.663 ms |       1.44% | 3.546 ms |  43.69% |   FAIL   |
|  I32  |     UNIQUE     |      0.4       |   8.051 ms |       0.48% |  11.506 ms |       1.54% | 3.456 ms |  42.92% |   FAIL   |
|  I32  |     UNIQUE     |      0.5       |   7.948 ms |       0.65% |  11.341 ms |       1.17% | 3.393 ms |  42.69% |   FAIL   |
|  I32  |     UNIQUE     |      0.6       |   7.855 ms |       0.51% |  11.233 ms |       1.17% | 3.378 ms |  43.00% |   FAIL   |
|  I32  |     UNIQUE     |      0.7       |   7.741 ms |       0.70% |  11.007 ms |       1.63% | 3.266 ms |  42.19% |   FAIL   |
|  I32  |     UNIQUE     |      0.8       |   7.625 ms |       0.65% |  11.398 ms |       2.33% | 3.774 ms |  49.49% |   FAIL   |
|  I32  |     UNIQUE     |      0.9       |   7.466 ms |       0.71% |  11.046 ms |       2.49% | 3.580 ms |  47.95% |   FAIL   |
|  I32  |     UNIQUE     |       1        |   7.309 ms |       0.70% |  10.643 ms |       1.80% | 3.334 ms |  45.61% |   FAIL   |
|  I64  |     UNIQUE     |      0.1       |   9.057 ms |       0.78% |  12.964 ms |       1.37% | 3.907 ms |  43.14% |   FAIL   |
|  I64  |     UNIQUE     |      0.2       |   9.011 ms |       0.86% |  12.940 ms |       1.64% | 3.929 ms |  43.60% |   FAIL   |
|  I64  |     UNIQUE     |      0.3       |   8.932 ms |       0.87% |  12.810 ms |       1.75% | 3.878 ms |  43.41% |   FAIL   |
|  I64  |     UNIQUE     |      0.4       |   8.845 ms |       0.88% |  12.661 ms |       1.58% | 3.816 ms |  43.14% |   FAIL   |
|  I64  |     UNIQUE     |      0.5       |   8.734 ms |       0.94% |  12.754 ms |       2.01% | 4.020 ms |  46.03% |   FAIL   |
|  I64  |     UNIQUE     |      0.6       |   8.602 ms |       0.88% |  12.500 ms |       2.29% | 3.898 ms |  45.31% |   FAIL   |
|  I64  |     UNIQUE     |      0.7       |   8.469 ms |       0.91% |  12.383 ms |       2.46% | 3.914 ms |  46.21% |   FAIL   |
|  I64  |     UNIQUE     |      0.8       |   8.311 ms |       0.68% |  12.076 ms |       2.42% | 3.765 ms |  45.30% |   FAIL   |
|  I64  |     UNIQUE     |      0.9       |   8.137 ms |       0.62% |  12.083 ms |       5.55% | 3.946 ms |  48.49% |   FAIL   |
|  I64  |     UNIQUE     |       1        |   7.972 ms |       0.57% |  12.332 ms |       3.25% | 4.361 ms |  54.70% |   FAIL   |

# static_set_find_unique_capacity

## [0] NVIDIA H100 PCIe

|  Key  |  Distribution  |  NumInputs  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|-------|----------------|-------------|------------|-------------|------------|-------------|------------|---------|----------|
|  I32  |     UNIQUE     |    8000     |  14.053 us |       3.02% |  14.828 us |       6.93% |   0.775 us |   5.51% |   FAIL   |
|  I32  |     UNIQUE     |    80000    |  18.155 us |       2.32% |  19.197 us |       1.94% |   1.041 us |   5.74% |   FAIL   |
|  I32  |     UNIQUE     |   800000    |  60.050 us |       0.74% |  70.011 us |       2.03% |   9.961 us |  16.59% |   FAIL   |
|  I32  |     UNIQUE     |   8000000   | 554.572 us |       0.18% | 757.652 us |       1.95% | 203.080 us |  36.62% |   FAIL   |
|  I32  |     UNIQUE     |  80000000   |   6.406 ms |       0.64% |   8.986 ms |       4.27% |   2.580 ms |  40.27% |   FAIL   |
|  I64  |     UNIQUE     |    8000     |  14.037 us |       2.43% |  15.018 us |       5.74% |   0.981 us |   6.99% |   FAIL   |
|  I64  |     UNIQUE     |    80000    |  19.135 us |       2.04% |  20.040 us |       2.08% |   0.905 us |   4.73% |   FAIL   |
|  I64  |     UNIQUE     |   800000    |  64.984 us |       0.66% |  80.014 us |       3.08% |  15.030 us |  23.13% |   FAIL   |
|  I64  |     UNIQUE     |   8000000   | 656.400 us |       0.16% | 864.700 us |       1.42% | 208.299 us |  31.73% |   FAIL   |
|  I64  |     UNIQUE     |  80000000   |   7.084 ms |       0.69% |  11.365 ms |       5.43% |   4.281 ms |  60.44% |   FAIL   |

# static_set_insert_uniform_multiplicity

## [0] NVIDIA H100 PCIe

|  Key  |  Distribution  |  Multiplicity  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |     Diff |   %Diff |  Status  |
|-------|----------------|----------------|------------|-------------|------------|-------------|----------|---------|----------|
|  I32  |    UNIFORM     |       1        |  14.208 ms |       0.08% |  17.293 ms |       2.01% | 3.086 ms |  21.72% |   FAIL   |
|  I32  |    UNIFORM     |       2        |  11.030 ms |       0.04% |  14.260 ms |       1.02% | 3.230 ms |  29.28% |   FAIL   |
|  I32  |    UNIFORM     |       4        |   9.134 ms |       0.07% |  12.766 ms |       0.86% | 3.632 ms |  39.76% |   FAIL   |
|  I32  |    UNIFORM     |       8        |   8.527 ms |       0.22% |  11.326 ms |       0.59% | 2.799 ms |  32.82% |   FAIL   |
|  I32  |    UNIFORM     |       16       |   7.808 ms |       0.13% |  10.478 ms |       1.69% | 2.669 ms |  34.18% |   FAIL   |
|  I64  |    UNIFORM     |       1        |  15.489 ms |       0.05% |  17.663 ms |       0.37% | 2.175 ms |  14.04% |   FAIL   |
|  I64  |    UNIFORM     |       2        |  12.099 ms |       0.06% |  14.200 ms |       0.44% | 2.101 ms |  17.37% |   FAIL   |
|  I64  |    UNIFORM     |       4        |   9.933 ms |       0.13% |  12.724 ms |       1.53% | 2.792 ms |  28.11% |   FAIL   |
|  I64  |    UNIFORM     |       8        |   9.158 ms |       0.13% |  12.108 ms |       0.96% | 2.950 ms |  32.22% |   FAIL   |
|  I64  |    UNIFORM     |       16       |   8.635 ms |       0.16% |  11.404 ms |       1.11% | 2.768 ms |  32.06% |   FAIL   |

# static_set_insert_unique_occupancy

## [0] NVIDIA H100 PCIe

|  Key  |  Distribution  |  Occupancy  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|-------|----------------|-------------|------------|-------------|------------|-------------|------------|---------|----------|
|  I32  |     UNIQUE     |     0.1     |  16.724 ms |       0.06% |  18.636 ms |       0.80% |   1.912 ms |  11.43% |   FAIL   |
|  I32  |     UNIQUE     |     0.2     |  15.823 ms |       0.04% |  17.637 ms |       0.51% |   1.814 ms |  11.47% |   FAIL   |
|  I32  |     UNIQUE     |     0.3     |  15.474 ms |       0.08% |  17.114 ms |       0.12% |   1.639 ms |  10.59% |   FAIL   |
|  I32  |     UNIQUE     |     0.4     |  15.284 ms |       0.04% |  16.717 ms |       0.44% |   1.433 ms |   9.38% |   FAIL   |
|  I32  |     UNIQUE     |     0.5     |  15.224 ms |       0.05% |  16.751 ms |       0.27% |   1.527 ms |  10.03% |   FAIL   |
|  I32  |     UNIQUE     |     0.6     |  15.311 ms |       0.04% |  17.107 ms |       0.22% |   1.797 ms |  11.73% |   FAIL   |
|  I32  |     UNIQUE     |     0.7     |  15.589 ms |       0.04% |  17.884 ms |       0.31% |   2.295 ms |  14.72% |   FAIL   |
|  I32  |     UNIQUE     |     0.8     |  16.241 ms |       0.26% |  19.113 ms |       0.38% |   2.872 ms |  17.68% |   FAIL   |
|  I32  |     UNIQUE     |     0.9     |  17.800 ms |       0.13% |  21.516 ms |       0.61% |   3.716 ms |  20.87% |   FAIL   |
|  I64  |     UNIQUE     |     0.1     |  16.292 ms |       1.99% |  16.504 ms |       2.01% | 212.216 us |   1.30% |   PASS   |
|  I64  |     UNIQUE     |     0.2     |  17.457 ms |       8.43% |  17.531 ms |       8.31% |  73.924 us |   0.42% |   PASS   |
|  I64  |     UNIQUE     |     0.3     |  16.937 ms |       0.06% |  17.048 ms |       0.23% | 110.982 us |   0.66% |   FAIL   |
|  I64  |     UNIQUE     |     0.4     |  16.684 ms |       0.03% |  16.928 ms |       0.37% | 243.595 us |   1.46% |   FAIL   |
|  I64  |     UNIQUE     |     0.5     |  16.596 ms |       0.03% |  17.089 ms |       0.54% | 493.461 us |   2.97% |   FAIL   |
|  I64  |     UNIQUE     |     0.6     |  16.674 ms |       0.03% |  17.337 ms |       0.27% | 663.562 us |   3.98% |   FAIL   |
|  I64  |     UNIQUE     |     0.7     |  16.961 ms |       0.03% |  18.082 ms |       0.74% |   1.121 ms |   6.61% |   FAIL   |
|  I64  |     UNIQUE     |     0.8     |  17.626 ms |       0.11% |  19.294 ms |       0.27% |   1.668 ms |   9.46% |   FAIL   |
|  I64  |     UNIQUE     |     0.9     |  19.194 ms |       0.36% |  21.339 ms |       0.43% |   2.145 ms |  11.17% |   FAIL   |

# static_set_insert_gaussian_skew

## [0] NVIDIA H100 PCIe

|  Key  |  Distribution  |  Skew  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|-------|----------------|--------|------------|-------------|------------|-------------|------------|---------|----------|
|  I32  |    GAUSSIAN    |  0.1   |  10.696 ms |       0.13% |  11.765 ms |       0.55% |   1.069 ms |  10.00% |   FAIL   |
|  I32  |    GAUSSIAN    |  0.2   |  12.761 ms |       0.06% |  13.536 ms |       0.20% | 775.444 us |   6.08% |   FAIL   |
|  I32  |    GAUSSIAN    |  0.3   |  13.603 ms |       0.22% |  14.444 ms |       0.79% | 840.478 us |   6.18% |   FAIL   |
|  I32  |    GAUSSIAN    |  0.4   |  13.480 ms |       0.03% |  14.524 ms |       0.04% |   1.044 ms |   7.74% |   FAIL   |
|  I32  |    GAUSSIAN    |  0.5   |  13.302 ms |       0.04% |  14.345 ms |       0.03% |   1.043 ms |   7.84% |   FAIL   |
|  I32  |    GAUSSIAN    |  0.6   |  13.223 ms |       0.43% |  14.141 ms |       0.09% | 918.302 us |   6.94% |   FAIL   |
|  I32  |    GAUSSIAN    |  0.7   |  13.228 ms |       0.06% |  14.147 ms |       0.17% | 919.431 us |   6.95% |   FAIL   |
|  I32  |    GAUSSIAN    |  0.8   |  13.097 ms |       0.04% |  14.023 ms |       0.18% | 926.003 us |   7.07% |   FAIL   |
|  I32  |    GAUSSIAN    |  0.9   |  12.904 ms |       0.05% |  13.951 ms |       1.25% |   1.047 ms |   8.11% |   FAIL   |
|  I32  |    GAUSSIAN    |   1    |  12.836 ms |       0.04% |  14.116 ms |       0.15% |   1.280 ms |   9.97% |   FAIL   |
|  I64  |    GAUSSIAN    |  0.1   |  11.884 ms |       0.70% |  13.078 ms |       1.38% |   1.194 ms |  10.05% |   FAIL   |
|  I64  |    GAUSSIAN    |  0.2   |  14.225 ms |       0.25% |  15.534 ms |       1.41% |   1.309 ms |   9.20% |   FAIL   |
|  I64  |    GAUSSIAN    |  0.3   |  15.099 ms |       0.14% |  16.640 ms |       0.70% |   1.542 ms |  10.21% |   FAIL   |
|  I64  |    GAUSSIAN    |  0.4   |  14.926 ms |       0.15% |  16.668 ms |       0.43% |   1.743 ms |  11.68% |   FAIL   |
|  I64  |    GAUSSIAN    |  0.5   |  14.725 ms |       0.11% |  16.525 ms |       0.52% |   1.800 ms |  12.22% |   FAIL   |
|  I64  |    GAUSSIAN    |  0.6   |  14.518 ms |       0.07% |  16.290 ms |       0.58% |   1.772 ms |  12.21% |   FAIL   |
|  I64  |    GAUSSIAN    |  0.7   |  14.504 ms |       0.15% |  16.715 ms |       1.78% |   2.211 ms |  15.24% |   FAIL   |
|  I64  |    GAUSSIAN    |  0.8   |  14.392 ms |       0.31% |  16.794 ms |       1.03% |   2.402 ms |  16.69% |   FAIL   |
|  I64  |    GAUSSIAN    |  0.9   |  14.222 ms |       0.13% |  16.510 ms |       1.03% |   2.288 ms |  16.09% |   FAIL   |
|  I64  |    GAUSSIAN    |   1    |  14.182 ms |       0.45% |  16.510 ms |       1.00% |   2.328 ms |  16.42% |   FAIL   |

# static_set_retrieve_all_unique_occupancy

## [0] NVIDIA H100 PCIe

|  Key  |  Distribution  |  Occupancy  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |   %Diff |  Status  |
|-------|----------------|-------------|------------|-------------|------------|-------------|-------------|---------|----------|
|  I32  |     UNIQUE     |     0.1     |   3.934 ms |       0.95% |   4.400 ms |       1.16% |  466.340 us |  11.86% |   FAIL   |
|  I32  |     UNIQUE     |     0.2     |   2.158 ms |       0.41% |   2.467 ms |       0.72% |  309.386 us |  14.34% |   FAIL   |
|  I32  |     UNIQUE     |     0.3     |   1.541 ms |       0.33% |   1.772 ms |       0.95% |  231.461 us |  15.02% |   FAIL   |
|  I32  |     UNIQUE     |     0.4     |   1.218 ms |       0.42% |   1.424 ms |       0.46% |  206.088 us |  16.92% |   FAIL   |
|  I32  |     UNIQUE     |     0.5     |   1.045 ms |       0.63% |   1.229 ms |       0.71% |  184.511 us |  17.66% |   FAIL   |
|  I32  |     UNIQUE     |     0.6     | 926.631 us |       0.58% |   1.136 ms |       1.24% |  209.750 us |  22.64% |   FAIL   |
|  I32  |     UNIQUE     |     0.7     | 860.071 us |       0.76% |   1.028 ms |       1.01% |  167.593 us |  19.49% |   FAIL   |
|  I32  |     UNIQUE     |     0.8     | 798.632 us |       1.00% | 944.341 us |       1.13% |  145.709 us |  18.24% |   FAIL   |
|  I32  |     UNIQUE     |     0.9     | 779.046 us |       1.32% | 896.476 us |       1.97% |  117.430 us |  15.07% |   FAIL   |
|  I64  |     UNIQUE     |     0.1     |   5.836 ms |       0.24% |   6.243 ms |       0.43% |  406.339 us |   6.96% |   FAIL   |
|  I64  |     UNIQUE     |     0.2     |   3.262 ms |       5.29% |   3.473 ms |       5.18% |  210.734 us |   6.46% |   FAIL   |
|  I64  |     UNIQUE     |     0.3     |   2.378 ms |       0.80% |   2.519 ms |       0.47% |  140.819 us |   5.92% |   FAIL   |
|  I64  |     UNIQUE     |     0.4     |   2.021 ms |       1.51% |   2.058 ms |       0.58% |   37.018 us |   1.83% |   FAIL   |
|  I64  |     UNIQUE     |     0.5     |   1.835 ms |       2.18% |   1.783 ms |       0.66% |  -51.480 us |  -2.81% |   FAIL   |
|  I64  |     UNIQUE     |     0.6     |   1.689 ms |       2.23% |   1.593 ms |       0.83% |  -96.192 us |  -5.69% |   FAIL   |
|  I64  |     UNIQUE     |     0.7     |   1.606 ms |       0.75% |   1.479 ms |       0.82% | -127.319 us |  -7.93% |   FAIL   |
|  I64  |     UNIQUE     |     0.8     |   1.472 ms |       0.83% |   1.377 ms |       0.93% |  -94.950 us |  -6.45% |   FAIL   |
|  I64  |     UNIQUE     |     0.9     |   1.381 ms |       1.00% |   1.304 ms |       0.95% |  -77.287 us |  -5.60% |   FAIL   |

# static_set_size_unique_occupancy

## [0] NVIDIA H100 PCIe

|  Key  |  Distribution  |  Occupancy  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |   %Diff |  Status  |
|-------|----------------|-------------|------------|-------------|------------|-------------|-------------|---------|----------|
|  I32  |     UNIQUE     |     0.1     |  12.613 ms |       3.43% |  12.014 ms |       2.30% | -599.767 us |  -4.75% |   FAIL   |
|  I32  |     UNIQUE     |     0.2     |   5.748 ms |       3.63% |   5.741 ms |       2.66% |   -6.662 us |  -0.12% |   PASS   |
|  I32  |     UNIQUE     |     0.3     |   3.567 ms |       2.40% |   3.611 ms |       2.87% |   43.668 us |   1.22% |   PASS   |
|  I32  |     UNIQUE     |     0.4     |   2.594 ms |       1.03% |   2.619 ms |       0.91% |   24.665 us |   0.95% |   FAIL   |
|  I32  |     UNIQUE     |     0.5     |   2.101 ms |       1.00% |   2.125 ms |       0.77% |   24.366 us |   1.16% |   FAIL   |
|  I32  |     UNIQUE     |     0.6     |   1.773 ms |       1.39% |   1.825 ms |       1.15% |   52.042 us |   2.93% |   FAIL   |
|  I32  |     UNIQUE     |     0.7     |   1.526 ms |       1.33% |   1.572 ms |       1.29% |   46.180 us |   3.03% |   FAIL   |
|  I32  |     UNIQUE     |     0.8     |   1.342 ms |       1.15% |   1.391 ms |       1.38% |   48.933 us |   3.65% |   FAIL   |
|  I32  |     UNIQUE     |     0.9     |   1.202 ms |       1.79% |   1.236 ms |       1.86% |   33.991 us |   2.83% |   FAIL   |
|  I64  |     UNIQUE     |     0.1     |  12.018 ms |       0.99% |  12.427 ms |       0.91% |  409.239 us |   3.41% |   FAIL   |
|  I64  |     UNIQUE     |     0.2     |   6.069 ms |       4.42% |   6.190 ms |       4.13% |  121.385 us |   2.00% |   PASS   |
|  I64  |     UNIQUE     |     0.3     |   4.095 ms |       0.29% |   4.226 ms |       0.41% |  131.192 us |   3.20% |   FAIL   |
|  I64  |     UNIQUE     |     0.4     |   3.114 ms |       0.78% |   3.215 ms |       1.13% |  101.358 us |   3.26% |   FAIL   |
|  I64  |     UNIQUE     |     0.5     |   2.548 ms |       1.03% |   2.603 ms |       1.16% |   55.490 us |   2.18% |   FAIL   |
|  I64  |     UNIQUE     |     0.6     |   2.167 ms |       1.50% |   2.194 ms |       0.91% |   26.927 us |   1.24% |   FAIL   |
|  I64  |     UNIQUE     |     0.7     |   1.867 ms |       1.42% |   1.956 ms |       1.07% |   88.928 us |   4.76% |   FAIL   |
|  I64  |     UNIQUE     |     0.8     |   1.648 ms |       1.26% |   1.730 ms |       1.64% |   82.023 us |   4.98% |   FAIL   |
|  I64  |     UNIQUE     |     0.9     |   1.475 ms |       1.78% |   1.541 ms |       1.70% |   65.659 us |   4.45% |   FAIL   |

# static_set_rehash_unique_occupancy

## [0] NVIDIA H100 PCIe

|  Key  |  Distribution  |  Occupancy  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|-------|----------------|-------------|------------|-------------|------------|-------------|------------|---------|----------|
|  I32  |     UNIQUE     |     0.1     |   6.762 ms |       0.98% |   6.864 ms |       2.11% | 102.812 us |   1.52% |   FAIL   |
|  I32  |     UNIQUE     |     0.2     |   7.956 ms |       0.48% |   8.097 ms |       0.51% | 140.946 us |   1.77% |   FAIL   |
|  I32  |     UNIQUE     |     0.3     |  10.891 ms |       0.92% |  11.111 ms |       1.05% | 219.960 us |   2.02% |   FAIL   |
|  I32  |     UNIQUE     |     0.4     |  13.595 ms |       1.34% |  13.860 ms |       1.06% | 265.411 us |   1.95% |   FAIL   |
|  I32  |     UNIQUE     |     0.5     |  15.988 ms |       1.27% |  16.596 ms |       1.47% | 608.188 us |   3.80% |   FAIL   |
|  I32  |     UNIQUE     |     0.6     |  19.677 ms |       2.06% |  20.418 ms |       2.25% | 741.516 us |   3.77% |   FAIL   |
|  I32  |     UNIQUE     |     0.7     |  23.233 ms |       2.42% |  24.000 ms |       2.31% | 766.825 us |   3.30% |   FAIL   |
|  I32  |     UNIQUE     |     0.8     |  28.444 ms |       2.56% |  29.421 ms |       2.35% | 976.883 us |   3.43% |   FAIL   |
|  I32  |     UNIQUE     |     0.9     |  36.253 ms |       2.12% |  37.452 ms |       1.95% |   1.200 ms |   3.31% |   FAIL   |
|  I64  |     UNIQUE     |     0.1     |   7.466 ms |       0.30% |   7.569 ms |       0.75% | 102.416 us |   1.37% |   FAIL   |
|  I64  |     UNIQUE     |     0.2     |   8.749 ms |       0.39% |   8.916 ms |       0.49% | 167.402 us |   1.91% |   FAIL   |
|  I64  |     UNIQUE     |     0.3     |  11.827 ms |       0.84% |  12.086 ms |       1.25% | 259.724 us |   2.20% |   FAIL   |
|  I64  |     UNIQUE     |     0.4     |  14.539 ms |       1.51% |  15.006 ms |       0.96% | 466.828 us |   3.21% |   FAIL   |
|  I64  |     UNIQUE     |     0.5     |  17.157 ms |       1.61% |  17.646 ms |       1.58% | 488.846 us |   2.85% |   FAIL   |
|  I64  |     UNIQUE     |     0.6     |  20.882 ms |       1.93% |  21.546 ms |       1.61% | 663.449 us |   3.18% |   FAIL   |
|  I64  |     UNIQUE     |     0.7     |  24.540 ms |       1.97% |  25.327 ms |       2.28% | 787.724 us |   3.21% |   FAIL   |
|  I64  |     UNIQUE     |     0.8     |  29.950 ms |       2.37% |  30.745 ms |       2.25% | 794.890 us |   2.65% |   FAIL   |
|  I64  |     UNIQUE     |     0.9     |  37.942 ms |       1.81% |  38.965 ms |       2.12% |   1.022 ms |   2.69% |   FAIL   |

# Summary

- Total Matches: 198
  - Pass    (diff <= min_noise): 5
  - Unknown (infinite noise):    0
  - Failure (diff > min_noise):  193

sleeepyjack avatar Mar 22 '24 14:03 sleeepyjack

About 50% slower for find :scream:

PointKernel avatar Mar 22 '24 17:03 PointKernel

I cannot reproduce the performance regression with my local RTX8000:

# static_set_find_unique_occupancy

## [0] Quadro RTX 8000

|  Key  |  Distribution  |  Occupancy  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|-------|----------------|-------------|------------|-------------|------------|-------------|------------|---------|----------|
|  I32  |     UNIQUE     |     0.1     |  31.941 ms |       0.28% |  31.925 ms |       0.14% | -16.024 us |  -0.05% |   PASS   |
|  I32  |     UNIQUE     |     0.2     |  31.934 ms |       0.01% |  31.939 ms |       0.02% |   4.440 us |   0.01% |   FAIL   |
|  I32  |     UNIQUE     |     0.3     |  32.245 ms |       0.01% |  32.242 ms |       0.01% |  -2.355 us |  -0.01% |   PASS   |
|  I32  |     UNIQUE     |     0.4     |  33.012 ms |       0.02% |  33.002 ms |       0.01% | -10.054 us |  -0.03% |   FAIL   |
|  I32  |     UNIQUE     |     0.5     |  34.432 ms |       0.02% |  34.449 ms |       0.01% |  16.904 us |   0.05% |   FAIL   |
|  I32  |     UNIQUE     |     0.6     |  36.910 ms |       0.01% |  36.915 ms |       0.01% |   4.985 us |   0.01% |   FAIL   |
|  I32  |     UNIQUE     |     0.7     |  41.218 ms |       0.01% |  41.223 ms |       0.02% |   5.061 us |   0.01% |   FAIL   |
|  I32  |     UNIQUE     |     0.8     |  49.906 ms |       0.01% |  49.917 ms |       0.01% |  11.477 us |   0.02% |   FAIL   |
|  I32  |     UNIQUE     |     0.9     |  75.340 ms |       0.02% |  75.365 ms |       0.04% |  24.811 us |   0.03% |   FAIL   |
|  I64  |     UNIQUE     |     0.1     |  33.937 ms |       0.01% |  33.950 ms |       0.01% |  12.970 us |   0.04% |   FAIL   |
|  I64  |     UNIQUE     |     0.2     |  34.052 ms |       0.02% |  34.045 ms |       0.01% |  -7.124 us |  -0.02% |   FAIL   |
|  I64  |     UNIQUE     |     0.3     |  34.408 ms |       0.02% |  34.429 ms |       0.01% |  21.481 us |   0.06% |   FAIL   |
|  I64  |     UNIQUE     |     0.4     |  35.258 ms |       0.02% |  35.275 ms |       0.02% |  17.199 us |   0.05% |   FAIL   |
|  I64  |     UNIQUE     |     0.5     |  36.791 ms |       0.01% |  36.802 ms |       0.01% |  11.845 us |   0.03% |   FAIL   |
|  I64  |     UNIQUE     |     0.6     |  39.389 ms |       0.02% |  39.397 ms |       0.01% |   8.795 us |   0.02% |   FAIL   |
|  I64  |     UNIQUE     |     0.7     |  43.888 ms |       0.01% |  43.890 ms |       0.01% |   2.224 us |   0.01% |   PASS   |
|  I64  |     UNIQUE     |     0.8     |  52.894 ms |       0.02% |  52.914 ms |       0.01% |  20.027 us |   0.04% |   FAIL   |
|  I64  |     UNIQUE     |     0.9     |  79.321 ms |       0.02% |  79.380 ms |       0.04% |  59.164 us |   0.07% |   FAIL   |

Could this issue be H100-specific?

PointKernel avatar Mar 25 '24 19:03 PointKernel

That's expected on a <sm_90 arch. From sm_90 going forward the function will use a different code path leveraging the new ELECT instruction. Let me collect some profiles on H100 so we can investigate what's going on.

sleeepyjack avatar Mar 26 '24 03:03 sleeepyjack

Update: I ran the same exact benchmarks on another H100 node (CTK 12.3) today and wasn't able to reproduce the regression I initially reported. Will run another test on a different node and report back.

sleeepyjack avatar May 16 '24 00:05 sleeepyjack

@sleeepyjack Any updates on H100 perf results?

PointKernel avatar May 28 '24 22:05 PointKernel

I tested another H100 HBM3 node with CTK 12.3 and the performance regression is still present although much less pronounced (around 5-6% less throughput).

The instruction diff between the baseline and this version is straightforward: The baseline

if (tile.thread_rank() == 0) {...}

compiles to

ISETP.NE.AND P0, PT, R7, RZ, PT

which sets P0=true for the leader thread.

The cg::invoke_one version translates to

IMAD.MOV.U32 R15, RZ, RZ, 0xf
BSSY B0, 0x7f8f8ec57d60
SHF.L.U32 R15, R15, R8, RZ
WARPSYNC.EXCLUSIVE R15
ELECT P0, URZ, ~URZ
BSYNC B0

Compared to the baseline, the new version additionally computes the mask of participating threads (SHF) which is then used to synchronize these threads (WARPSYNC) before electing a leader (ELECT).

I have some profiles which I will share via Slack since I had to use an internal toolchain.

sleeepyjack avatar May 29 '24 16:05 sleeepyjack