Add host-bulk `insert_or_apply` using shared_memory
This PR add's host-bulk insert_or_apply using shared_memory which could improve performance in low cardinality and very high mulitiplicty case.
This pull request requires additional validation before any workflows can run on NVIDIA's runners.
Pull request vetters can view their responsibilities here.
Contributors can view more details about this message here.
/ok to test
/ok to test
/ok to test
Benchmarks :
Cmp time = global memory implementation [before] Ref time = shared memory implementation (current PR) [after]
['./shmem_h100.json', './global_h100.json']
# static_map_insert_or_apply_uniform_multiplicity
## [0] NVIDIA H100 80GB HBM3
| Key | Value | Distribution | Cardinality | NumInputs | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|-------|---------|----------------|---------------|-------------|------------|-------------|------------|-------------|--------------|----------|----------|
| I32 | I32 | UNIFORM | 1 | 1 | 37.307 us | 2.21% | 36.785 us | 3.37% | -0.522 us | -1.40% | [32mPASS[39m |
| I32 | I32 | UNIFORM | 1 | 128 | 37.070 us | 3.87% | 36.509 us | 1.84% | -0.561 us | -1.51% | [32mPASS[39m |
| I32 | I32 | UNIFORM | 128 | 128 | 37.009 us | 2.08% | 36.674 us | 3.10% | -0.335 us | -0.91% | [32mPASS[39m |
| I32 | I32 | UNIFORM | 1 | 256 | 36.768 us | 2.52% | 36.496 us | 2.83% | -0.272 us | -0.74% | [32mPASS[39m |
| I32 | I32 | UNIFORM | 128 | 256 | 36.806 us | 2.99% | 36.533 us | 5.51% | -0.273 us | -0.74% | [32mPASS[39m |
| I32 | I32 | UNIFORM | 256 | 256 | 36.721 us | 2.39% | 36.544 us | 2.20% | -0.177 us | -0.48% | [32mPASS[39m |
| I32 | I32 | UNIFORM | 1 | 512 | 36.696 us | 3.22% | 36.249 us | 2.04% | -0.447 us | -1.22% | [32mPASS[39m |
| I32 | I32 | UNIFORM | 128 | 512 | 36.663 us | 2.43% | 36.321 us | 1.74% | -0.342 us | -0.93% | [32mPASS[39m |
| I32 | I32 | UNIFORM | 256 | 512 | 36.743 us | 6.61% | 36.546 us | 5.13% | -0.197 us | -0.54% | [32mPASS[39m |
| I32 | I32 | UNIFORM | 512 | 512 | 36.805 us | 2.31% | 36.466 us | 3.55% | -0.339 us | -0.92% | [32mPASS[39m |
| I32 | I32 | UNIFORM | 1 | 1000 | 36.985 us | 2.81% | 36.676 us | 2.76% | -0.309 us | -0.84% | [32mPASS[39m |
| I32 | I32 | UNIFORM | 128 | 1000 | 37.197 us | 2.32% | 36.760 us | 1.77% | -0.437 us | -1.17% | [32mPASS[39m |
| I32 | I32 | UNIFORM | 256 | 1000 | 37.080 us | 2.59% | 36.822 us | 1.85% | -0.257 us | -0.69% | [32mPASS[39m |
| I32 | I32 | UNIFORM | 512 | 1000 | 37.184 us | 2.69% | 36.692 us | 2.27% | -0.492 us | -1.32% | [32mPASS[39m |
| I32 | I32 | UNIFORM | 1000 | 1000 | 37.198 us | 3.92% | 38.413 us | 9.73% | 1.215 us | 3.27% | [32mPASS[39m |
| I32 | I32 | UNIFORM | 1 | 10000 | 52.425 us | 4.91% | 51.852 us | 5.11% | -0.572 us | -1.09% | [32mPASS[39m |
| I32 | I32 | UNIFORM | 128 | 10000 | 36.888 us | 1.83% | 36.601 us | 1.87% | -0.287 us | -0.78% | [32mPASS[39m |
| I32 | I32 | UNIFORM | 256 | 10000 | 37.065 us | 2.17% | 36.566 us | 1.80% | -0.499 us | -1.35% | [32mPASS[39m |
| I32 | I32 | UNIFORM | 512 | 10000 | 36.885 us | 1.90% | 36.536 us | 1.75% | -0.349 us | -0.95% | [32mPASS[39m |
| I32 | I32 | UNIFORM | 1000 | 10000 | 37.044 us | 2.77% | 36.616 us | 2.77% | -0.428 us | -1.16% | [32mPASS[39m |
| I32 | I32 | UNIFORM | 10000 | 10000 | 45.115 us | 1.57% | 44.828 us | 1.42% | -0.286 us | -0.63% | [32mPASS[39m |
| I32 | I32 | UNIFORM | 1 | 100000 | 203.928 us | 4.67% | 207.112 us | 2.12% | 3.184 us | 1.56% | [32mPASS[39m |
| I32 | I32 | UNIFORM | 128 | 100000 | 109.486 us | 8.56% | 112.419 us | 0.84% | 2.933 us | 2.68% | [31mFAIL[39m |
| I32 | I32 | UNIFORM | 256 | 100000 | 109.871 us | 7.70% | 112.517 us | 0.71% | 2.646 us | 2.41% | [31mFAIL[39m |
| I32 | I32 | UNIFORM | 512 | 100000 | 109.773 us | 8.81% | 112.311 us | 0.89% | 2.539 us | 2.31% | [31mFAIL[39m |
| I32 | I32 | UNIFORM | 1000 | 100000 | 108.793 us | 9.69% | 112.325 us | 3.46% | 3.532 us | 3.25% | [32mPASS[39m |
| I32 | I32 | UNIFORM | 10000 | 100000 | 109.010 us | 9.18% | 111.920 us | 1.52% | 2.909 us | 2.67% | [31mFAIL[39m |
| I32 | I32 | UNIFORM | 100000 | 100000 | 110.232 us | 7.93% | 112.106 us | 2.71% | 1.873 us | 1.70% | [32mPASS[39m |
| I32 | I32 | UNIFORM | 1 | 1000000 | 244.797 us | 8.62% | 1.034 ms | 32.04% | 789.250 us | 322.41% | [31mFAIL[39m |
| I32 | I32 | UNIFORM | 128 | 1000000 | 244.435 us | 9.34% | 393.275 us | 299.68% | 148.840 us | 60.89% | [31mFAIL[39m |
| I32 | I32 | UNIFORM | 256 | 1000000 | 244.210 us | 7.45% | 318.566 us | 134.25% | 74.356 us | 30.45% | [31mFAIL[39m |
| I32 | I32 | UNIFORM | 512 | 1000000 | 244.703 us | 8.54% | 329.651 us | 88.30% | 84.948 us | 34.71% | [31mFAIL[39m |
| I32 | I32 | UNIFORM | 1000 | 1000000 | 244.287 us | 8.92% | 359.602 us | 139.77% | 115.315 us | 47.20% | [31mFAIL[39m |
| I32 | I32 | UNIFORM | 10000 | 1000000 | 244.203 us | 8.32% | 318.102 us | 97.06% | 73.899 us | 30.26% | [31mFAIL[39m |
| I32 | I32 | UNIFORM | 100000 | 1000000 | 243.934 us | 6.92% | 286.277 us | 58.85% | 42.343 us | 17.36% | [31mFAIL[39m |
| I32 | I32 | UNIFORM | 1000000 | 1000000 | 261.056 us | 5.88% | 362.666 us | 89.73% | 101.610 us | 38.92% | [31mFAIL[39m |
| I32 | I32 | UNIFORM | 1 | 10000000 | 678.790 us | 2.28% | 8.454 ms | 6.74% | 7.775 ms | 1145.42% | [31mFAIL[39m |
| I32 | I32 | UNIFORM | 128 | 10000000 | 673.851 us | 2.01% | 1.576 ms | 348.86% | 902.336 us | 133.91% | [31mFAIL[39m |
| I32 | I32 | UNIFORM | 256 | 10000000 | 680.172 us | 1.93% | 969.807 us | 2.83% | 289.635 us | 42.58% | [31mFAIL[39m |
| I32 | I32 | UNIFORM | 512 | 10000000 | 690.132 us | 1.85% | 922.436 us | 3.03% | 232.304 us | 33.66% | [31mFAIL[39m |
| I32 | I32 | UNIFORM | 1000 | 10000000 | 716.705 us | 2.04% | 912.846 us | 1.31% | 196.140 us | 27.37% | [31mFAIL[39m |
| I32 | I32 | UNIFORM | 10000 | 10000000 | 1.013 ms | 1.19% | 916.636 us | 0.99% | -95.951 us | -9.48% | [31mFAIL[39m |
| I32 | I32 | UNIFORM | 100000 | 10000000 | 1.041 ms | 1.41% | 936.241 us | 0.65% | -104.789 us | -10.07% | [31mFAIL[39m |
| I32 | I32 | UNIFORM | 1000000 | 10000000 | 1.232 ms | 0.77% | 1.153 ms | 0.82% | -78.668 us | -6.39% | [31mFAIL[39m |
| I32 | I32 | UNIFORM | 10000000 | 10000000 | 1.525 ms | 1.34% | 1.361 ms | 0.58% | -164.357 us | -10.77% | [31mFAIL[39m |
| I32 | I32 | UNIFORM | 1 | 100000000 | 4.563 ms | 0.38% | 78.140 ms | 0.24% | 73.578 ms | 1612.62% | [31mFAIL[39m |
| I32 | I32 | UNIFORM | 128 | 100000000 | 4.563 ms | 0.46% | 83.503 ms | 254.79% | 78.939 ms | 1729.84% | [31mFAIL[39m |
| I32 | I32 | UNIFORM | 256 | 100000000 | 4.563 ms | 0.40% | 6.912 ms | 49.11% | 2.349 ms | 51.49% | [31mFAIL[39m |
| I32 | I32 | UNIFORM | 512 | 100000000 | 4.565 ms | 0.52% | 5.884 ms | 0.57% | 1.319 ms | 28.90% | [31mFAIL[39m |
| I32 | I32 | UNIFORM | 1000 | 100000000 | 4.563 ms | 0.49% | 5.898 ms | 0.43% | 1.335 ms | 29.26% | [31mFAIL[39m |
| I32 | I32 | UNIFORM | 10000 | 100000000 | 7.966 ms | 1.12% | 10.623 ms | 299.99% | 2.657 ms | 33.35% | [31mFAIL[39m |
| I32 | I32 | UNIFORM | 100000 | 100000000 | 8.208 ms | 1.46% | 6.876 ms | 0.23% | -1331.875 us | -16.23% | [31mFAIL[39m |
| I32 | I32 | UNIFORM | 1000000 | 100000000 | 11.469 ms | 23.15% | 10.047 ms | 0.98% | -1422.299 us | -12.40% | [31mFAIL[39m |
| I32 | I32 | UNIFORM | 10000000 | 100000000 | 14.049 ms | 0.25% | 12.469 ms | 0.86% | -1579.866 us | -11.25% | [31mFAIL[39m |
| I32 | I32 | UNIFORM | 100000000 | 100000000 | 15.535 ms | 0.53% | 12.898 ms | 3.16% | -2636.527 us | -16.97% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 1 | 1 | 37.464 us | 2.03% | 36.719 us | 7.31% | -0.745 us | -1.99% | [32mPASS[39m |
| I64 | I64 | UNIFORM | 1 | 128 | 36.984 us | 1.95% | 36.550 us | 1.75% | -0.434 us | -1.17% | [32mPASS[39m |
| I64 | I64 | UNIFORM | 128 | 128 | 45.210 us | 1.50% | 44.572 us | 1.40% | -0.638 us | -1.41% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 1 | 256 | 36.912 us | 1.97% | 36.503 us | 1.93% | -0.409 us | -1.11% | [32mPASS[39m |
| I64 | I64 | UNIFORM | 128 | 256 | 37.069 us | 2.08% | 36.561 us | 2.71% | -0.508 us | -1.37% | [32mPASS[39m |
| I64 | I64 | UNIFORM | 256 | 256 | 37.391 us | 4.26% | 37.511 us | 6.58% | 0.120 us | 0.32% | [32mPASS[39m |
| I64 | I64 | UNIFORM | 1 | 512 | 37.179 us | 1.88% | 36.468 us | 1.89% | -0.712 us | -1.91% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 128 | 512 | 36.982 us | 1.95% | 36.524 us | 1.92% | -0.458 us | -1.24% | [32mPASS[39m |
| I64 | I64 | UNIFORM | 256 | 512 | 37.095 us | 1.99% | 36.460 us | 1.81% | -0.635 us | -1.71% | [32mPASS[39m |
| I64 | I64 | UNIFORM | 512 | 512 | 37.822 us | 6.17% | 37.988 us | 7.81% | 0.166 us | 0.44% | [32mPASS[39m |
| I64 | I64 | UNIFORM | 1 | 1000 | 37.550 us | 1.98% | 36.526 us | 1.84% | -1.025 us | -2.73% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 128 | 1000 | 37.208 us | 2.13% | 36.655 us | 1.87% | -0.552 us | -1.48% | [32mPASS[39m |
| I64 | I64 | UNIFORM | 256 | 1000 | 37.434 us | 3.45% | 36.648 us | 1.78% | -0.785 us | -2.10% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 512 | 1000 | 37.491 us | 6.23% | 36.868 us | 3.58% | -0.623 us | -1.66% | [32mPASS[39m |
| I64 | I64 | UNIFORM | 1000 | 1000 | 43.626 us | 7.58% | 43.458 us | 6.59% | -0.169 us | -0.39% | [32mPASS[39m |
| I64 | I64 | UNIFORM | 1 | 10000 | 53.436 us | 2.33% | 52.636 us | 2.47% | -0.800 us | -1.50% | [32mPASS[39m |
| I64 | I64 | UNIFORM | 128 | 10000 | 37.415 us | 1.88% | 36.464 us | 22.43% | -0.951 us | -2.54% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 256 | 10000 | 37.369 us | 1.84% | 36.597 us | 1.78% | -0.771 us | -2.06% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 512 | 10000 | 37.482 us | 1.88% | 36.518 us | 2.20% | -0.964 us | -2.57% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 1000 | 10000 | 37.171 us | 2.03% | 41.138 us | 20.32% | 3.967 us | 10.67% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 10000 | 10000 | 45.737 us | 1.46% | 45.163 us | 3.03% | -0.573 us | -1.25% | [32mPASS[39m |
| I64 | I64 | UNIFORM | 1 | 100000 | 193.064 us | 4.08% | 549.889 us | 720.60% | 356.825 us | 184.82% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 128 | 100000 | 109.698 us | 5.05% | 112.364 us | 1.20% | 2.666 us | 2.43% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 256 | 100000 | 109.690 us | 5.47% | 212.610 us | 562.95% | 102.920 us | 93.83% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 512 | 100000 | 109.808 us | 5.65% | 207.059 us | 1495.14% | 97.251 us | 88.56% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 1000 | 100000 | 109.780 us | 5.69% | 112.236 us | 0.94% | 2.457 us | 2.24% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 10000 | 100000 | 109.738 us | 4.20% | 120.851 us | 435.33% | 11.113 us | 10.13% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 100000 | 100000 | 119.594 us | 92.07% | 120.331 us | 2.61% | 0.737 us | 0.62% | [32mPASS[39m |
| I64 | I64 | UNIFORM | 1 | 1000000 | 264.261 us | 8.38% | 1.085 ms | 3.64% | 820.626 us | 310.54% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 128 | 1000000 | 262.781 us | 5.94% | 305.585 us | 81.15% | 42.804 us | 16.29% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 256 | 1000000 | 263.915 us | 6.21% | 379.687 us | 111.75% | 115.772 us | 43.87% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 512 | 1000000 | 263.520 us | 4.70% | 349.172 us | 133.56% | 85.652 us | 32.50% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 1000 | 1000000 | 263.825 us | 6.61% | 442.121 us | 118.65% | 178.296 us | 67.58% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 10000 | 1000000 | 265.812 us | 6.00% | 362.939 us | 87.24% | 97.127 us | 36.54% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 100000 | 1000000 | 264.857 us | 4.10% | 302.192 us | 76.95% | 37.335 us | 14.10% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 1000000 | 1000000 | 302.500 us | 2.84% | 431.469 us | 87.92% | 128.968 us | 42.63% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 1 | 10000000 | 1.240 ms | 1.22% | 11.841 ms | 99.01% | 10.601 ms | 854.57% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 128 | 10000000 | 1.237 ms | 0.67% | 3.153 ms | 272.79% | 1.916 ms | 154.85% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 256 | 10000000 | 1.249 ms | 0.31% | 1.728 ms | 0.49% | 478.947 us | 38.33% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 512 | 10000000 | 1.268 ms | 0.71% | 1.661 ms | 1.02% | 392.987 us | 30.99% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 1000 | 10000000 | 1.310 ms | 0.35% | 1.596 ms | 0.61% | 286.049 us | 21.84% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 10000 | 10000000 | 1.576 ms | 0.40% | 1.533 ms | 0.49% | -43.815 us | -2.78% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 100000 | 10000000 | 1.576 ms | 0.38% | 1.544 ms | 0.49% | -32.773 us | -2.08% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 1000000 | 10000000 | 2.135 ms | 0.72% | 1.921 ms | 0.85% | -213.980 us | -10.02% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 10000000 | 10000000 | 2.402 ms | 0.99% | 2.178 ms | 1.04% | -223.449 us | -9.30% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 1 | 100000000 | 5.645 ms | 4.22% | 86.638 ms | 0.62% | 80.993 ms | 1434.71% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 128 | 100000000 | 6.365 ms | 32.25% | 12.675 ms | 1.16% | 6.310 ms | 99.14% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 256 | 100000000 | 6.613 ms | 40.31% | 11.624 ms | 0.62% | 5.010 ms | 75.76% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 512 | 100000000 | 7.042 ms | 42.60% | 10.042 ms | 2.08% | 3.000 ms | 42.60% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 1000 | 100000000 | 6.739 ms | 14.99% | 10.212 ms | 0.30% | 3.474 ms | 51.55% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 10000 | 100000000 | 11.769 ms | 2.87% | 11.108 ms | 0.36% | -660.833 us | -5.62% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 100000 | 100000000 | 11.778 ms | 0.14% | 11.308 ms | 0.64% | -470.690 us | -4.00% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 1000000 | 100000000 | 16.359 ms | 0.55% | 13.897 ms | 1.54% | -2461.981 us | -15.05% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 10000000 | 100000000 | 19.474 ms | 0.42% | 17.215 ms | 0.27% | -2259.446 us | -11.60% | [31mFAIL[39m |
| I64 | I64 | UNIFORM | 100000000 | 100000000 | 20.222 ms | 0.30% | 17.643 ms | 5.83% | -2578.661 us | -12.75% | [31mFAIL[39m |
# Summary
- Total Matches: 110
- Pass (diff <= min_noise): 38
- Unknown (infinite noise): 0
- Failure (diff > min_noise): 72
/ok to test
/ok to test
I think there are issues with rebasing. I need to resolve it and push the changes again.
/ok to test
As CI tests for GCC 12 fail, I have added a workaround (in e804b4c) to pass the error, by using a pre-constructed value to be used as size of shared_map.
/ok to test
/ok to test
/ok to test
/ok to test
/ok to test
/ok to test