Also compare batch measurements in nvbench_compare.py
Fixes: #247
Cold and batch measurements can sometimes differ substantially, so we want to show both. An example is kernels using PDL (Programmatic Dependent Launch).
Here is a comparison of DeviceTransform with and without PDL (see also https://github.com/NVIDIA/cccl/pull/5249):
# mul
## [0] NVIDIA B200
| T{ct} | OffsetT{ct} | Elements{io} | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status | B Ref Time | B Cmp Time | B Diff | B %Diff | B Status |
|---------|---------------|----------------|------------|-------------|------------|-------------|-----------|---------|----------|--------------|--------------|-----------|-----------|------------|
| I8 | I32 | 2^16 | 5.729 us | 11.77% | 5.734 us | 11.65% | 0.005 us | 0.08% | SAME | 4.094 us | 1.622 us | -2.472 us | -60.39% | FAST |
| I8 | I32 | 2^20 | 6.016 us | 5.61% | 6.130 us | 7.31% | 0.114 us | 1.90% | SAME | 4.101 us | 2.831 us | -1.270 us | -30.97% | FAST |
| I8 | I32 | 2^24 | 13.739 us | 5.56% | 13.620 us | 5.84% | -0.119 us | -0.87% | SAME | 10.271 us | 8.388 us | -1.883 us | -18.33% | FAST |
| I8 | I32 | 2^28 | 114.526 us | 0.25% | 114.550 us | 0.28% | 0.024 us | 0.02% | SAME | 112.599 us | 110.039 us | -2.560 us | -2.27% | FAST |
| I8 | I64 | 2^16 | 5.494 us | 14.80% | 5.512 us | 14.22% | 0.018 us | 0.33% | SAME | 4.094 us | 1.594 us | -2.500 us | -61.06% | FAST |
| I8 | I64 | 2^20 | 6.059 us | 5.71% | 6.383 us | 9.59% | 0.324 us | 5.35% | SAME | 4.101 us | 2.826 us | -1.274 us | -31.07% | FAST |
| I8 | I64 | 2^24 | 14.075 us | 2.96% | 14.041 us | 3.52% | -0.035 us | -0.25% | SAME | 10.290 us | 8.436 us | -1.854 us | -18.02% | FAST |
| I8 | I64 | 2^28 | 115.311 us | 0.80% | 115.433 us | 0.82% | 0.122 us | 0.11% | SAME | 112.952 us | 110.872 us | -2.079 us | -1.84% | FAST |
| I16 | I32 | 2^16 | 5.346 us | 16.26% | 5.377 us | 15.65% | 0.031 us | 0.58% | SAME | 4.094 us | 1.599 us | -2.496 us | -60.96% | FAST |
| I16 | I32 | 2^20 | 6.099 us | 6.33% | 6.380 us | 9.76% | 0.281 us | 4.61% | SAME | 4.099 us | 2.815 us | -1.285 us | -31.34% | FAST |
| I16 | I32 | 2^24 | 16.382 us | 3.32% | 16.376 us | 3.23% | -0.006 us | -0.04% | SAME | 12.288 us | 10.074 us | -2.214 us | -18.02% | FAST |
| I16 | I32 | 2^28 | 161.785 us | 0.28% | 161.935 us | 0.44% | 0.150 us | 0.09% | SAME | 159.618 us | 156.623 us | -2.996 us | -1.88% | FAST |
| I16 | I64 | 2^16 | 5.672 us | 12.72% | 5.756 us | 11.04% | 0.084 us | 1.48% | SAME | 4.094 us | 1.577 us | -2.518 us | -61.49% | FAST |
| I16 | I64 | 2^20 | 6.129 us | 6.10% | 6.391 us | 10.08% | 0.262 us | 4.28% | SAME | 4.102 us | 2.816 us | -1.286 us | -31.36% | FAST |
| I16 | I64 | 2^24 | 16.434 us | 3.20% | 16.467 us | 3.44% | 0.034 us | 0.21% | SAME | 12.289 us | 10.120 us | -2.169 us | -17.65% | FAST |
| I16 | I64 | 2^28 | 163.663 us | 0.22% | 163.721 us | 0.16% | 0.058 us | 0.04% | SAME | 160.345 us | 158.210 us | -2.135 us | -1.33% | FAST |
| F32 | I32 | 2^16 | 5.935 us | 6.94% | 5.907 us | 6.24% | -0.028 us | -0.47% | SAME | 4.094 us | 1.595 us | -2.499 us | -61.04% | FAST |
| F32 | I32 | 2^20 | 7.543 us | 10.91% | 7.497 us | 10.90% | -0.046 us | -0.61% | SAME | 4.099 us | 2.856 us | -1.244 us | -30.34% | FAST |
| F32 | I32 | 2^24 | 25.470 us | 3.65% | 25.395 us | 3.59% | -0.075 us | -0.29% | SAME | 20.843 us | 19.593 us | -1.251 us | -6.00% | FAST |
| F32 | I32 | 2^28 | 313.226 us | 0.32% | 313.291 us | 0.30% | 0.065 us | 0.02% | SAME | 311.189 us | 308.332 us | -2.857 us | -0.92% | SAME |
| F32 | I64 | 2^16 | 5.545 us | 14.41% | 5.487 us | 14.25% | -0.058 us | -1.05% | SAME | 4.094 us | 1.600 us | -2.494 us | -60.92% | FAST |
| F32 | I64 | 2^20 | 7.494 us | 11.26% | 7.432 us | 11.17% | -0.062 us | -0.83% | SAME | 4.099 us | 2.859 us | -1.241 us | -30.26% | FAST |
| F32 | I64 | 2^24 | 25.602 us | 3.56% | 25.608 us | 3.53% | 0.006 us | 0.02% | SAME | 20.726 us | 19.591 us | -1.136 us | -5.48% | FAST |
| F32 | I64 | 2^28 | 313.271 us | 0.33% | 313.254 us | 0.31% | -0.017 us | -0.01% | SAME | 311.100 us | 308.282 us | -2.818 us | -0.91% | SAME |
| F64 | I32 | 2^16 | 5.706 us | 11.69% | 5.719 us | 11.38% | 0.014 us | 0.24% | SAME | 4.094 us | 1.596 us | -2.498 us | -61.01% | FAST |
| F64 | I32 | 2^20 | 8.084 us | 4.22% | 8.043 us | 4.93% | -0.041 us | -0.51% | SAME | 4.108 us | 2.882 us | -1.226 us | -29.85% | FAST |
| F64 | I32 | 2^24 | 45.629 us | 2.07% | 45.498 us | 1.93% | -0.131 us | -0.29% | SAME | 43.046 us | 40.308 us | -2.738 us | -6.36% | FAST |
| F64 | I32 | 2^28 | 620.092 us | 0.23% | 620.155 us | 0.19% | 0.063 us | 0.01% | SAME | 617.714 us | 614.871 us | -2.843 us | -0.46% | SAME |
| F64 | I64 | 2^16 | 5.698 us | 11.99% | 5.698 us | 11.30% | -0.001 us | -0.01% | SAME | 4.094 us | 1.616 us | -2.478 us | -60.53% | FAST |
| F64 | I64 | 2^20 | 8.098 us | 4.25% | 8.032 us | 4.07% | -0.067 us | -0.82% | SAME | 4.106 us | 2.896 us | -1.210 us | -29.47% | FAST |
| F64 | I64 | 2^24 | 45.517 us | 2.01% | 45.637 us | 2.05% | 0.119 us | 0.26% | SAME | 43.031 us | 40.252 us | -2.779 us | -6.46% | FAST |
| F64 | I64 | 2^28 | 620.032 us | 0.22% | 620.114 us | 0.22% | 0.082 us | 0.01% | SAME | 617.629 us | 614.842 us | -2.786 us | -0.45% | SAME |
| I128 | I32 | 2^16 | 5.959 us | 4.88% | 5.917 us | 5.20% | -0.042 us | -0.71% | SAME | 4.094 us | 1.596 us | -2.499 us | -61.03% | FAST |
| I128 | I32 | 2^20 | 10.308 us | 4.86% | 10.378 us | 5.50% | 0.070 us | 0.67% | SAME | 6.141 us | 5.100 us | -1.042 us | -16.97% | FAST |
| I128 | I32 | 2^24 | 83.570 us | 0.98% | 83.639 us | 0.89% | 0.068 us | 0.08% | SAME | 81.530 us | 78.615 us | -2.916 us | -3.58% | FAST |
| I128 | I32 | 2^28 | 1.235 ms | 0.15% | 1.235 ms | 0.15% | -0.010 us | -0.00% | SAME | 1.233 ms | 1.231 ms | -2.248 us | -0.18% | SAME |
| I128 | I64 | 2^16 | 5.979 us | 4.55% | 5.995 us | 4.38% | 0.016 us | 0.27% | SAME | 4.094 us | 1.601 us | -2.493 us | -60.90% | FAST |
| I128 | I64 | 2^20 | 10.320 us | 4.89% | 10.433 us | 6.27% | 0.113 us | 1.10% | SAME | 6.141 us | 5.091 us | -1.050 us | -17.10% | FAST |
| I128 | I64 | 2^24 | 83.769 us | 0.74% | 83.803 us | 0.77% | 0.034 us | 0.04% | SAME | 81.544 us | 78.602 us | -2.942 us | -3.61% | FAST |
| I128 | I64 | 2^28 | 1.234 ms | 0.15% | 1.234 ms | 0.14% | -0.059 us | -0.00% | SAME | 1.233 ms | 1.231 ms | -1.907 us | -0.15% | SAME |
# add
## [0] NVIDIA B200
| T{ct} | OffsetT{ct} | Elements{io} | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status | B Ref Time | B Cmp Time | B Diff | B %Diff | B Status |
|---------|---------------|----------------|------------|-------------|------------|-------------|------------|---------|----------|--------------|--------------|------------|-----------|------------|
| I8 | I32 | 2^16 | 5.829 us | 9.25% | 5.838 us | 8.67% | 0.009 us | 0.16% | SAME | 4.094 us | 1.594 us | -2.500 us | -61.06% | FAST |
| I8 | I32 | 2^20 | 6.113 us | 6.09% | 6.388 us | 10.02% | 0.275 us | 4.49% | SAME | 4.098 us | 3.051 us | -1.047 us | -25.55% | FAST |
| I8 | I32 | 2^24 | 15.292 us | 6.13% | 15.207 us | 6.09% | -0.085 us | -0.56% | SAME | 11.264 us | 9.365 us | -1.899 us | -16.86% | FAST |
| I8 | I32 | 2^28 | 141.213 us | 0.11% | 141.350 us | 0.41% | 0.137 us | 0.10% | SAME | 137.296 us | 135.994 us | -1.302 us | -0.95% | SAME |
| I8 | I64 | 2^16 | 5.919 us | 7.47% | 5.926 us | 6.03% | 0.007 us | 0.12% | SAME | 4.094 us | 1.604 us | -2.490 us | -60.82% | FAST |
| I8 | I64 | 2^20 | 6.111 us | 5.90% | 6.422 us | 10.09% | 0.310 us | 5.07% | SAME | 4.098 us | 3.097 us | -1.001 us | -24.43% | FAST |
| I8 | I64 | 2^24 | 15.775 us | 5.14% | 15.685 us | 5.19% | -0.091 us | -0.57% | SAME | 11.875 us | 9.387 us | -2.489 us | -20.96% | FAST |
| I8 | I64 | 2^28 | 143.584 us | 0.49% | 143.338 us | 0.34% | -0.246 us | -0.17% | SAME | 141.093 us | 137.760 us | -3.333 us | -2.36% | FAST |
| I16 | I32 | 2^16 | 5.936 us | 6.48% | 5.841 us | 9.35% | -0.095 us | -1.60% | SAME | 4.094 us | 1.627 us | -2.467 us | -60.26% | FAST |
| I16 | I32 | 2^20 | 6.649 us | 11.53% | 6.616 us | 11.37% | -0.033 us | -0.49% | SAME | 4.098 us | 3.023 us | -1.075 us | -26.23% | FAST |
| I16 | I32 | 2^24 | 22.386 us | 1.29% | 22.510 us | 2.15% | 0.123 us | 0.55% | SAME | 16.525 us | 18.175 us | 1.650 us | 9.98% | SLOW |
| I16 | I32 | 2^28 | 237.871 us | 0.33% | 272.302 us | 0.11% | 34.431 us | 14.47% | SLOW | 233.873 us | 268.593 us | 34.719 us | 14.85% | SLOW |
| I16 | I64 | 2^16 | 5.623 us | 13.25% | 5.612 us | 13.14% | -0.011 us | -0.20% | SAME | 4.094 us | 1.604 us | -2.490 us | -60.82% | FAST |
| I16 | I64 | 2^20 | 6.673 us | 11.74% | 6.706 us | 11.73% | 0.033 us | 0.50% | SAME | 4.098 us | 3.101 us | -0.997 us | -24.32% | FAST |
| I16 | I64 | 2^24 | 22.365 us | 1.38% | 22.528 us | 1.83% | 0.163 us | 0.73% | SAME | 16.525 us | 18.208 us | 1.683 us | 10.19% | SLOW |
| I16 | I64 | 2^28 | 239.449 us | 0.17% | 272.324 us | 0.19% | 32.875 us | 13.73% | SLOW | 235.292 us | 268.593 us | 33.301 us | 14.15% | SLOW |
| F32 | I32 | 2^16 | 5.787 us | 10.26% | 5.779 us | 10.38% | -0.008 us | -0.13% | SAME | 4.094 us | 1.575 us | -2.519 us | -61.52% | FAST |
| F32 | I32 | 2^20 | 8.008 us | 3.51% | 7.970 us | 3.98% | -0.038 us | -0.47% | SAME | 4.098 us | 2.875 us | -1.223 us | -29.84% | FAST |
| F32 | I32 | 2^24 | 35.876 us | 2.71% | 38.696 us | 0.88% | 2.820 us | 7.86% | SLOW | 32.666 us | 34.847 us | 2.182 us | 6.68% | SLOW |
| F32 | I32 | 2^28 | 454.211 us | 0.26% | 539.121 us | 0.17% | 84.909 us | 18.69% | SLOW | 452.508 us | 535.671 us | 83.163 us | 18.38% | SLOW |
| F32 | I64 | 2^16 | 5.958 us | 4.79% | 5.942 us | 5.19% | -0.017 us | -0.28% | SAME | 4.094 us | 1.590 us | -2.504 us | -61.17% | FAST |
| F32 | I64 | 2^20 | 8.016 us | 3.59% | 8.033 us | 3.33% | 0.018 us | 0.22% | SAME | 4.098 us | 2.880 us | -1.218 us | -29.73% | FAST |
| F32 | I64 | 2^24 | 35.917 us | 2.54% | 38.719 us | 0.85% | 2.802 us | 7.80% | SLOW | 32.694 us | 34.865 us | 2.171 us | 6.64% | SLOW |
| F32 | I64 | 2^28 | 453.938 us | 0.27% | 539.019 us | 0.16% | 85.081 us | 18.74% | SLOW | 452.062 us | 535.638 us | 83.577 us | 18.49% | SLOW |
| F64 | I32 | 2^16 | 6.006 us | 4.47% | 5.973 us | 4.50% | -0.033 us | -0.55% | SAME | 4.094 us | 1.594 us | -2.501 us | -61.08% | FAST |
| F64 | I32 | 2^20 | 10.086 us | 2.80% | 10.059 us | 3.17% | -0.027 us | -0.26% | SAME | 6.141 us | 5.197 us | -0.944 us | -15.37% | FAST |
| F64 | I32 | 2^24 | 65.164 us | 1.05% | 71.574 us | 0.51% | 6.411 us | 9.84% | SLOW | 59.868 us | 68.208 us | 8.340 us | 13.93% | SLOW |
| F64 | I32 | 2^28 | 909.760 us | 0.20% | 1.073 ms | 0.05% | 163.440 us | 17.97% | SLOW | 905.075 us | 1.070 ms | 164.766 us | 18.20% | SLOW |
| F64 | I64 | 2^16 | 5.962 us | 4.90% | 5.957 us | 5.03% | -0.005 us | -0.09% | SAME | 4.094 us | 1.583 us | -2.511 us | -61.34% | FAST |
| F64 | I64 | 2^20 | 10.092 us | 2.78% | 10.089 us | 2.82% | -0.003 us | -0.03% | SAME | 6.141 us | 5.192 us | -0.950 us | -15.46% | FAST |
| F64 | I64 | 2^24 | 65.317 us | 0.74% | 71.592 us | 0.53% | 6.275 us | 9.61% | SLOW | 59.828 us | 68.206 us | 8.378 us | 14.00% | SLOW |
| F64 | I64 | 2^28 | 909.178 us | 0.19% | 1.073 ms | 0.06% | 164.056 us | 18.04% | SLOW | 904.768 us | 1.070 ms | 165.069 us | 18.24% | SLOW |
| I128 | I32 | 2^16 | 5.997 us | 4.65% | 5.947 us | 5.01% | -0.050 us | -0.83% | SAME | 4.094 us | 1.593 us | -2.501 us | -61.10% | FAST |
| I128 | I32 | 2^20 | 14.195 us | 1.87% | 14.194 us | 1.96% | -0.001 us | -0.00% | SAME | 8.188 us | 9.372 us | 1.184 us | 14.46% | SLOW |
| I128 | I32 | 2^24 | 120.742 us | 0.76% | 138.904 us | 0.47% | 18.162 us | 15.04% | SLOW | 116.723 us | 134.950 us | 18.227 us | 15.62% | SLOW |
| I128 | I32 | 2^28 | 1.811 ms | 0.15% | 2.142 ms | 0.02% | 330.608 us | 18.25% | SLOW | 1.807 ms | 2.140 ms | 333.116 us | 18.44% | SLOW |
| I128 | I64 | 2^16 | 5.979 us | 5.28% | 5.991 us | 4.69% | 0.012 us | 0.21% | SAME | 4.094 us | 1.594 us | -2.500 us | -61.07% | FAST |
| I128 | I64 | 2^20 | 14.176 us | 1.96% | 14.138 us | 2.08% | -0.038 us | -0.27% | SAME | 8.188 us | 9.378 us | 1.190 us | 14.53% | SLOW |
| I128 | I64 | 2^24 | 120.699 us | 0.75% | 138.979 us | 0.35% | 18.280 us | 15.15% | SLOW | 116.648 us | 134.950 us | 18.303 us | 15.69% | SLOW |
| I128 | I64 | 2^28 | 1.811 ms | 0.15% | 2.142 ms | 0.02% | 330.955 us | 18.28% | SLOW | 1.805 ms | 2.140 ms | 334.179 us | 18.51% | SLOW |
# triad
## [0] NVIDIA B200
| T{ct} | OffsetT{ct} | Elements{io} | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status | B Ref Time | B Cmp Time | B Diff | B %Diff | B Status |
|---------|---------------|----------------|------------|-------------|------------|-------------|------------|---------|----------|--------------|--------------|------------|-----------|------------|
| I8 | I32 | 2^16 | 5.814 us | 9.73% | 5.905 us | 6.76% | 0.091 us | 1.57% | SAME | 4.094 us | 1.602 us | -2.493 us | -60.88% | FAST |
| I8 | I32 | 2^20 | 6.128 us | 6.23% | 6.365 us | 9.87% | 0.237 us | 3.87% | SAME | 4.098 us | 3.039 us | -1.059 us | -25.83% | FAST |
| I8 | I32 | 2^24 | 16.207 us | 1.78% | 16.208 us | 1.79% | 0.001 us | 0.01% | SAME | 12.294 us | 10.362 us | -1.932 us | -15.72% | FAST |
| I8 | I32 | 2^28 | 149.994 us | 0.59% | 150.658 us | 0.63% | 0.664 us | 0.44% | SAME | 147.459 us | 145.001 us | -2.458 us | -1.67% | FAST |
| I8 | I64 | 2^16 | 6.006 us | 4.56% | 5.974 us | 4.71% | -0.032 us | -0.53% | SAME | 4.094 us | 1.613 us | -2.481 us | -60.60% | FAST |
| I8 | I64 | 2^20 | 6.083 us | 5.23% | 6.343 us | 9.49% | 0.260 us | 4.27% | SAME | 4.098 us | 3.083 us | -1.015 us | -24.76% | FAST |
| I8 | I64 | 2^24 | 16.208 us | 1.73% | 16.172 us | 1.79% | -0.037 us | -0.23% | SAME | 12.294 us | 10.623 us | -1.671 us | -13.59% | FAST |
| I8 | I64 | 2^28 | 153.088 us | 0.50% | 153.434 us | 0.21% | 0.346 us | 0.23% | SLOW | 149.586 us | 147.658 us | -1.928 us | -1.29% | FAST |
| I16 | I32 | 2^16 | 5.718 us | 11.90% | 5.721 us | 11.60% | 0.003 us | 0.05% | SAME | 4.094 us | 1.605 us | -2.489 us | -60.81% | FAST |
| I16 | I32 | 2^20 | 6.924 us | 13.05% | 7.008 us | 12.94% | 0.084 us | 1.21% | SAME | 4.098 us | 2.979 us | -1.119 us | -27.30% | FAST |
| I16 | I32 | 2^24 | 22.350 us | 1.38% | 22.718 us | 2.59% | 0.369 us | 1.65% | SLOW | 16.474 us | 18.079 us | 1.605 us | 9.74% | SLOW |
| I16 | I32 | 2^28 | 239.548 us | 0.12% | 272.308 us | 0.13% | 32.761 us | 13.68% | SLOW | 235.182 us | 268.432 us | 33.249 us | 14.14% | SLOW |
| I16 | I64 | 2^16 | 5.888 us | 8.48% | 5.782 us | 9.90% | -0.106 us | -1.80% | SAME | 4.094 us | 1.592 us | -2.502 us | -61.12% | FAST |
| I16 | I64 | 2^20 | 7.166 us | 12.98% | 7.195 us | 12.81% | 0.030 us | 0.41% | SAME | 4.098 us | 3.083 us | -1.015 us | -24.78% | FAST |
| I16 | I64 | 2^24 | 22.381 us | 1.20% | 22.704 us | 2.70% | 0.322 us | 1.44% | SLOW | 17.928 us | 18.183 us | 0.255 us | 1.42% | SLOW |
| I16 | I64 | 2^28 | 241.571 us | 0.12% | 272.374 us | 0.17% | 30.803 us | 12.75% | SLOW | 237.152 us | 268.438 us | 31.286 us | 13.19% | SLOW |
| F32 | I32 | 2^16 | 5.989 us | 4.50% | 5.950 us | 4.88% | -0.039 us | -0.66% | SAME | 4.094 us | 1.594 us | -2.500 us | -61.06% | FAST |
| F32 | I32 | 2^20 | 8.021 us | 3.47% | 7.989 us | 3.89% | -0.032 us | -0.39% | SAME | 4.098 us | 2.865 us | -1.233 us | -30.09% | FAST |
| F32 | I32 | 2^24 | 36.517 us | 1.70% | 38.782 us | 0.80% | 2.266 us | 6.20% | SLOW | 32.731 us | 34.859 us | 2.128 us | 6.50% | SLOW |
| F32 | I32 | 2^28 | 453.094 us | 0.24% | 539.057 us | 0.16% | 85.962 us | 18.97% | SLOW | 452.672 us | 535.696 us | 83.024 us | 18.34% | SLOW |
| F32 | I64 | 2^16 | 5.973 us | 4.63% | 5.960 us | 5.83% | -0.014 us | -0.23% | SAME | 4.094 us | 1.606 us | -2.488 us | -60.77% | FAST |
| F32 | I64 | 2^20 | 8.038 us | 3.43% | 8.009 us | 3.70% | -0.028 us | -0.35% | SAME | 4.098 us | 2.870 us | -1.228 us | -29.96% | FAST |
| F32 | I64 | 2^24 | 36.686 us | 0.97% | 38.769 us | 0.86% | 2.083 us | 5.68% | SLOW | 32.731 us | 34.859 us | 2.127 us | 6.50% | SLOW |
| F32 | I64 | 2^28 | 452.862 us | 0.19% | 539.153 us | 0.17% | 86.292 us | 19.05% | SLOW | 452.065 us | 535.694 us | 83.629 us | 18.50% | SLOW |
| F64 | I32 | 2^16 | 5.980 us | 4.76% | 5.963 us | 4.62% | -0.016 us | -0.27% | SAME | 4.094 us | 1.609 us | -2.486 us | -60.71% | FAST |
| F64 | I32 | 2^20 | 10.062 us | 2.89% | 10.054 us | 2.78% | -0.009 us | -0.09% | SAME | 6.141 us | 5.194 us | -0.947 us | -15.43% | FAST |
| F64 | I32 | 2^24 | 64.131 us | 1.48% | 71.512 us | 0.48% | 7.381 us | 11.51% | SLOW | 59.679 us | 68.255 us | 8.576 us | 14.37% | SLOW |
| F64 | I32 | 2^28 | 907.488 us | 0.24% | 1.073 ms | 0.05% | 165.704 us | 18.26% | SLOW | 904.912 us | 1.070 ms | 164.891 us | 18.22% | SLOW |
| F64 | I64 | 2^16 | 5.975 us | 4.58% | 5.921 us | 5.35% | -0.054 us | -0.90% | SAME | 4.094 us | 1.616 us | -2.478 us | -60.53% | FAST |
| F64 | I64 | 2^20 | 10.087 us | 2.88% | 10.049 us | 3.08% | -0.039 us | -0.39% | SAME | 6.141 us | 5.190 us | -0.951 us | -15.48% | FAST |
| F64 | I64 | 2^24 | 64.740 us | 1.54% | 71.541 us | 0.52% | 6.801 us | 10.50% | SLOW | 59.532 us | 68.144 us | 8.612 us | 14.47% | SLOW |
| F64 | I64 | 2^28 | 907.463 us | 0.25% | 1.073 ms | 0.04% | 165.692 us | 18.26% | SLOW | 904.402 us | 1.070 ms | 165.408 us | 18.29% | SLOW |
| I128 | I32 | 2^16 | 5.976 us | 4.80% | 5.969 us | 5.03% | -0.007 us | -0.11% | SAME | 4.094 us | 1.596 us | -2.498 us | -61.02% | FAST |
| I128 | I32 | 2^20 | 14.157 us | 1.98% | 14.101 us | 2.22% | -0.056 us | -0.40% | SAME | 8.188 us | 9.375 us | 1.187 us | 14.50% | SLOW |
| I128 | I32 | 2^24 | 119.885 us | 0.79% | 138.877 us | 0.45% | 18.992 us | 15.84% | SLOW | 116.405 us | 134.968 us | 18.563 us | 15.95% | SLOW |
| I128 | I32 | 2^28 | 1.812 ms | 0.11% | 2.142 ms | 0.02% | 329.563 us | 18.18% | SLOW | 1.806 ms | 2.140 ms | 333.777 us | 18.48% | SLOW |
| I128 | I64 | 2^16 | 5.983 us | 4.41% | 6.001 us | 5.10% | 0.017 us | 0.29% | SAME | 4.094 us | 1.578 us | -2.516 us | -61.46% | FAST |
| I128 | I64 | 2^20 | 14.136 us | 2.12% | 14.144 us | 2.09% | 0.008 us | 0.06% | SAME | 8.188 us | 9.380 us | 1.192 us | 14.56% | SLOW |
| I128 | I64 | 2^24 | 120.039 us | 0.76% | 138.907 us | 0.43% | 18.869 us | 15.72% | SLOW | 116.317 us | 134.990 us | 18.673 us | 16.05% | SLOW |
| I128 | I64 | 2^28 | 1.812 ms | 0.11% | 2.142 ms | 0.03% | 329.903 us | 18.21% | SLOW | 1.805 ms | 2.140 ms | 334.969 us | 18.56% | SLOW |
# nstream
## [0] NVIDIA B200
| T{ct} | OffsetT{ct} | Elements{io} | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status | B Ref Time | B Cmp Time | B Diff | B %Diff | B Status |
|---------|---------------|----------------|------------|-------------|------------|-------------|------------|---------|----------|--------------|--------------|------------|-----------|------------|
| I8 | I32 | 2^16 | 5.989 us | 4.55% | 5.949 us | 5.01% | -0.039 us | -0.66% | SAME | 4.094 us | 1.634 us | -2.460 us | -60.09% | FAST |
| I8 | I32 | 2^20 | 6.342 us | 9.26% | 6.419 us | 10.11% | 0.077 us | 1.21% | SAME | 4.098 us | 3.117 us | -0.981 us | -23.94% | FAST |
| I8 | I32 | 2^24 | 19.445 us | 4.85% | 19.424 us | 4.68% | -0.021 us | -0.11% | SAME | 14.343 us | 13.192 us | -1.151 us | -8.02% | FAST |
| I8 | I32 | 2^28 | 212.863 us | 0.10% | 211.458 us | 0.43% | -1.405 us | -0.66% | FAST | 209.033 us | 205.241 us | -3.792 us | -1.81% | FAST |
| I8 | I64 | 2^16 | 6.009 us | 4.13% | 5.967 us | 4.81% | -0.042 us | -0.70% | SAME | 4.094 us | 1.645 us | -2.449 us | -59.82% | FAST |
| I8 | I64 | 2^20 | 6.334 us | 8.91% | 6.337 us | 9.49% | 0.003 us | 0.05% | SAME | 4.098 us | 3.126 us | -0.972 us | -23.71% | FAST |
| I8 | I64 | 2^24 | 19.865 us | 4.15% | 19.762 us | 4.17% | -0.102 us | -0.51% | SAME | 14.343 us | 13.214 us | -1.129 us | -7.87% | FAST |
| I8 | I64 | 2^28 | 215.113 us | 0.26% | 214.918 us | 0.08% | -0.195 us | -0.09% | FAST | 212.031 us | 208.688 us | -3.342 us | -1.58% | FAST |
| I16 | I32 | 2^16 | 6.009 us | 4.15% | 5.982 us | 4.60% | -0.027 us | -0.45% | SAME | 4.094 us | 1.614 us | -2.480 us | -60.58% | FAST |
| I16 | I32 | 2^20 | 7.932 us | 6.13% | 7.939 us | 5.75% | 0.006 us | 0.08% | SAME | 4.098 us | 3.114 us | -0.984 us | -24.01% | FAST |
| I16 | I32 | 2^24 | 26.983 us | 2.80% | 28.431 us | 1.60% | 1.448 us | 5.37% | SLOW | 22.510 us | 23.762 us | 1.252 us | 5.56% | SLOW |
| I16 | I32 | 2^28 | 322.852 us | 0.30% | 362.068 us | 0.20% | 39.215 us | 12.15% | SLOW | 319.790 us | 357.594 us | 37.804 us | 11.82% | SLOW |
| I16 | I64 | 2^16 | 5.987 us | 4.39% | 5.932 us | 4.98% | -0.055 us | -0.92% | SAME | 4.094 us | 1.619 us | -2.475 us | -60.44% | FAST |
| I16 | I64 | 2^20 | 7.946 us | 6.01% | 7.860 us | 6.95% | -0.086 us | -1.08% | SAME | 4.098 us | 3.140 us | -0.958 us | -23.38% | FAST |
| I16 | I64 | 2^24 | 27.225 us | 3.23% | 28.478 us | 1.04% | 1.253 us | 4.60% | SLOW | 22.522 us | 23.750 us | 1.228 us | 5.45% | SLOW |
| I16 | I64 | 2^28 | 327.108 us | 0.26% | 361.990 us | 0.22% | 34.882 us | 10.66% | SLOW | 324.185 us | 357.615 us | 33.430 us | 10.31% | SLOW |
| F32 | I32 | 2^16 | 5.977 us | 4.80% | 5.978 us | 4.67% | 0.000 us | 0.01% | SAME | 4.094 us | 1.632 us | -2.462 us | -60.13% | FAST |
| F32 | I32 | 2^20 | 8.471 us | 7.81% | 8.566 us | 8.58% | 0.095 us | 1.12% | SAME | 4.514 us | 3.815 us | -0.699 us | -15.48% | FAST |
| F32 | I32 | 2^24 | 46.447 us | 1.83% | 49.517 us | 1.50% | 3.070 us | 6.61% | SLOW | 41.011 us | 46.008 us | 4.997 us | 12.19% | SLOW |
| F32 | I32 | 2^28 | 598.549 us | 0.15% | 717.324 us | 0.12% | 118.774 us | 19.84% | SLOW | 592.635 us | 713.743 us | 121.108 us | 20.44% | SLOW |
| F32 | I64 | 2^16 | 5.988 us | 4.55% | 5.959 us | 4.85% | -0.029 us | -0.49% | SAME | 4.094 us | 1.614 us | -2.480 us | -60.57% | FAST |
| F32 | I64 | 2^20 | 8.514 us | 8.41% | 8.492 us | 8.07% | -0.022 us | -0.26% | SAME | 4.715 us | 3.826 us | -0.889 us | -18.86% | FAST |
| F32 | I64 | 2^24 | 46.541 us | 1.71% | 49.648 us | 1.55% | 3.107 us | 6.68% | SLOW | 41.069 us | 45.971 us | 4.902 us | 11.94% | SLOW |
| F32 | I64 | 2^28 | 599.451 us | 0.14% | 717.347 us | 0.13% | 117.896 us | 19.67% | SLOW | 593.715 us | 713.745 us | 120.029 us | 20.22% | SLOW |
| F64 | I32 | 2^16 | 5.958 us | 4.65% | 5.965 us | 4.72% | 0.007 us | 0.11% | SAME | 4.094 us | 1.625 us | -2.469 us | -60.31% | FAST |
| F64 | I32 | 2^20 | 11.491 us | 7.71% | 11.504 us | 7.48% | 0.013 us | 0.11% | SAME | 6.141 us | 5.222 us | -0.919 us | -14.96% | FAST |
| F64 | I32 | 2^24 | 84.027 us | 0.80% | 84.034 us | 0.92% | 0.007 us | 0.01% | SAME | 77.789 us | 74.972 us | -2.817 us | -3.62% | FAST |
| F64 | I32 | 2^28 | 1.184 ms | 0.05% | 1.184 ms | 0.05% | 0.029 us | 0.00% | SAME | 1.176 ms | 1.173 ms | -2.920 us | -0.25% | SAME |
| F64 | I64 | 2^16 | 5.966 us | 4.90% | 5.963 us | 4.86% | -0.003 us | -0.06% | SAME | 4.094 us | 1.637 us | -2.457 us | -60.02% | FAST |
| F64 | I64 | 2^20 | 11.513 us | 7.79% | 11.494 us | 7.78% | -0.019 us | -0.17% | SAME | 6.141 us | 5.216 us | -0.925 us | -15.06% | FAST |
| F64 | I64 | 2^24 | 84.030 us | 0.82% | 84.060 us | 0.76% | 0.030 us | 0.04% | SAME | 77.804 us | 74.994 us | -2.810 us | -3.61% | FAST |
| F64 | I64 | 2^28 | 1.184 ms | 0.05% | 1.184 ms | 0.06% | 0.075 us | 0.01% | SAME | 1.176 ms | 1.173 ms | -2.845 us | -0.24% | SAME |
| I128 | I32 | 2^16 | 6.066 us | 5.68% | 6.017 us | 5.84% | -0.049 us | -0.81% | SAME | 4.094 us | 1.592 us | -2.502 us | -61.11% | FAST |
| I128 | I32 | 2^20 | 16.358 us | 2.88% | 16.406 us | 2.93% | 0.048 us | 0.29% | SAME | 8.188 us | 9.399 us | 1.211 us | 14.79% | SLOW |
| I128 | I32 | 2^24 | 156.487 us | 0.62% | 156.500 us | 0.61% | 0.014 us | 0.01% | SAME | 150.507 us | 147.953 us | -2.554 us | -1.70% | FAST |
| I128 | I32 | 2^28 | 2.356 ms | 0.04% | 2.356 ms | 0.04% | -0.046 us | -0.00% | SAME | 2.349 ms | 2.347 ms | -1.491 us | -0.06% | SAME |
| I128 | I64 | 2^16 | 6.084 us | 6.74% | 6.073 us | 6.19% | -0.010 us | -0.17% | SAME | 4.094 us | 1.607 us | -2.487 us | -60.74% | FAST |
| I128 | I64 | 2^20 | 16.395 us | 2.82% | 16.393 us | 2.69% | -0.002 us | -0.01% | SAME | 8.188 us | 9.410 us | 1.222 us | 14.92% | SLOW |
| I128 | I64 | 2^24 | 156.555 us | 0.64% | 156.506 us | 0.62% | -0.049 us | -0.03% | SAME | 150.525 us | 147.965 us | -2.560 us | -1.70% | FAST |
| I128 | I64 | 2^28 | 2.356 ms | 0.04% | 2.356 ms | 0.04% | -0.012 us | -0.00% | SAME | 2.349 ms | 2.347 ms | -1.875 us | -0.08% | SAME |
# Summary
- Total Matches: 160
- Pass (diff <= min_noise): 130
- Unknown (infinite noise): 0
- Failure (diff > min_noise): 190
The table becomes a bit unwieldy. We could consider dropping the Diff and B Diff columns to improve the situation. Alternatively, we could emit two rows per benchmark.
I like the idea of splitting them to a new line, I think it'd be cleaner.
Or making them into separate tables? That way you could still quickly scan a column to check for outliers. That'd be harder if the timings were alternating cold/batch.