Implement `cub::DeviceFind::FindIf`
This is a draft to track the work progress on cub::DeviceFind::FindIf which should ultimately be used to improve thrust::all_of.
Algorithm Description
The kernel is based on the concept of early cancellation through a global atomicresult -that works as a flag- so that when its default value (input.size()) is overwritten, it is atomically min compared and then broadcasted to the other CTAs and they avoid proceeding to any further iterations.
if (threadIdx.x == 0)
{
sresult = atomicAdd(result, 0);
}
__syncthreads();
// early exit
if (sresult < tile_offset)
{
return;
}
For that to happen every block collects a local minimum index, block_result (when predicate is found)
if (index < num_items)
{
if (pred(input_items[i]))
{
found = true;
atomicMin(&block_result, index);
break; // every thread goes over multiple elements per thread
// for every tile. If a thread finds a local minimum it doesn't
// need to proceed further (inner early exit).
}
}
and at the end of the loop-body the first thread of every block atomically minCompares and writes to the global minimum index variable:
if (syncthreads_or(found))
{
if (threadIdx.x == 0)
{
if (block_result < num_items)
{
atomicMin(result, block_result);
}
}
}
It's a an algorithm that heavily favors cases where the "to-be-found" elements are ubiquitous.
Notes:
- To achieve proper synchronization, every block reads a local copy of the global result (that potentially has been found by other blocks in previous iterations) into the shared memory
sresultand then checks whether it should exit.- If the other blocks didn't "get to update the global result" early enough then, because the check for the global early exit is at the beginning of the loop the worst case scenario is a block would do one extra redundant iteration before exiting.
block_resultshould not be confused withsresult. Although both are shared memory variables that hold a local minimum, the former is being used to read from the global atomic, while the latter is being used to write to the global atomic.- The algorithm is utilizing vectorized loads for small input types (split implementation for large input types to be added).
# Benchmark Results
## cub::FindIf
### [0] NVIDIA H200
| T | Elements | RelativeMismatchPosition | Samples | CPU Time | Noise | GPU Time | Noise |
|-----|------------------|--------------------------|---------|------------|--------|------------|-------|
| I32 | 2^16 = 65536 | 1 | 22928x | 26.790 us | 22.92% | 21.814 us | 2.06% |
| I32 | 2^20 = 1048576 | 1 | 19632x | 31.313 us | 23.28% | 25.470 us | 3.40% |
| I32 | 2^24 = 16777216 | 1 | 10800x | 51.443 us | 11.07% | 46.351 us | 1.28% |
| I32 | 2^28 = 268435456 | 1 | 1074x | 471.441 us | 1.32% | 465.567 us | 0.36% |
| I32 | 2^16 = 65536 | 0.5 | 23024x | 26.862 us | 23.81% | 21.717 us | 2.03% |
| I32 | 2^20 = 1048576 | 0.5 | 19792x | 31.141 us | 23.58% | 25.273 us | 3.86% |
| I32 | 2^24 = 16777216 | 0.5 | 17152x | 34.301 us | 17.73% | 29.160 us | 1.59% |
| I32 | 2^28 = 268435456 | 0.5 | 2064x | 247.413 us | 2.15% | 242.283 us | 0.35% |
| I32 | 2^16 = 65536 | 0 | 54416x | 14.286 us | 55.68% | 9.189 us | 3.01% |
| I32 | 2^20 = 1048576 | 0 | 52288x | 14.649 us | 53.34% | 9.565 us | 3.21% |
| I32 | 2^24 = 16777216 | 0 | 33568x | 19.990 us | 34.29% | 14.901 us | 1.96% |
| I32 | 2^28 = 268435456 | 0 | 33632x | 20.006 us | 34.67% | 14.874 us | 2.54% |
## thrust::count_if
### [0] NVIDIA H200
| T | Elements | RelativeMismatchPosition | Samples | CPU Time | Noise | GPU Time | Noise |
|-----|------------------|--------------------------|---------|------------|--------|------------|-------|
| I32 | 2^16 = 65536 | 1 | 24528x | 24.829 us | 22.00% | 20.390 us | 2.94% |
| I32 | 2^20 = 1048576 | 1 | 22480x | 26.721 us | 20.33% | 22.244 us | 2.62% |
| I32 | 2^24 = 16777216 | 1 | 11856x | 46.720 us | 10.84% | 42.222 us | 1.89% |
| I32 | 2^28 = 268435456 | 1 | 1855x | 274.137 us | 1.77% | 269.582 us | 0.49% |
| I32 | 2^16 = 65536 | 0.5 | 24176x | 25.121 us | 21.73% | 20.685 us | 3.23% |
| I32 | 2^20 = 1048576 | 0.5 | 22048x | 27.172 us | 19.97% | 22.689 us | 2.65% |
| I32 | 2^24 = 16777216 | 0.5 | 11744x | 47.062 us | 10.71% | 42.587 us | 1.97% |
| I32 | 2^28 = 268435456 | 0.5 | 1859x | 273.367 us | 1.69% | 268.966 us | 0.41% |
| I32 | 2^16 = 65536 | 0 | 24160x | 25.209 us | 22.02% | 20.707 us | 3.19% |
| I32 | 2^20 = 1048576 | 0 | 22144x | 27.092 us | 20.12% | 22.594 us | 2.75% |
| I32 | 2^24 = 16777216 | 0 | 11744x | 47.111 us | 10.85% | 42.584 us | 2.02% |
| I32 | 2^28 = 268435456 | 0 | 1862x | 273.046 us | 1.70% | 268.609 us | 0.38% |
until we make sure that it performs better than reduce.
I would love to see a benchmark comparison of thrust::all_of before and after your improvements at some point!
Initial bench results are encouraging.
Do I understand correctly, that cub::FindIf should outperform thrust::count_if? Because I am seeing the opposite on some of the numbers you posted.
The name CommonPrefixRatio comes from the old thrust::equals benchmark, where it denoted the amount of equal elements at the beginning of both ranges. A CommonPrefixRatio of 1 meant all elements were the same in both ranges. A value of 0.5 meant, the first 50% of the range were equal and then the mismatch occurs.
For your benchmark you could name this value RelativeMismatchPosition, or MismatchAt, or something like that. It's about where you expect the mismatch to happen.
@bernhardmgruber cub::FindIf because of the early exit is expected to perform better than thrust::count_if at least in the middle cases RelativeMismatchPosition = 0.5 (thanx for the name suggestion). On the axis of input size, this is true for small and medium sized inputs, but not for large inputs.
🟨 CI finished in 1h 52m: Pass: 97%/259 | Total: 1d 08h | Avg: 7m 28s | Max: 35m 24s | Hits: 99%/20079
-
🟨 cub: Pass: 95%/136 | Total: 19h 26m | Avg: 8m 34s | Max: 35m 24s
🔍 cpu: amd64 🔍 🔍 amd64 Pass: 95%/128 | Total: 18h 52m | Avg: 8m 50s | Max: 35m 24s 🟩 arm64 Pass: 100%/8 | Total: 33m 44s | Avg: 4m 13s | Max: 4m 33s 🔍 cudacxx_family: nvcc 🔍 🟩 ClangCUDA Pass: 100%/2 | Total: 7m 19s | Avg: 3m 39s | Max: 3m 49s 🔍 nvcc Pass: 95%/134 | Total: 19h 19m | Avg: 8m 39s | Max: 35m 24s 🚨 cxx_family: MSVC 🚨 🟩 Clang Pass: 100%/63 | Total: 9h 04m | Avg: 8m 38s | Max: 30m 14s 🟩 GCC Pass: 100%/64 | Total: 9h 07m | Avg: 8m 33s | Max: 35m 24s 🟩 Intel Pass: 100%/3 | Total: 16m 34s | Avg: 5m 31s | Max: 5m 42s 🔥 MSVC Pass: 0%/6 | Total: 57m 26s | Avg: 9m 34s | Max: 10m 42s 🔍 jobs: Build 🔍 🔍 Build Pass: 94%/103 | Total: 8h 05m | Avg: 4m 42s | Max: 10m 42s 🟩 DeviceLaunch Pass: 100%/8 | Total: 2h 28m | Avg: 18m 35s | Max: 19m 52s 🟩 GraphCapture Pass: 100%/8 | Total: 2h 07m | Avg: 15m 54s | Max: 19m 10s 🟩 HostLaunch Pass: 100%/8 | Total: 2h 28m | Avg: 18m 32s | Max: 22m 38s 🟩 SmallGMem Pass: 100%/1 | Total: 35m 24s | Avg: 35m 24s | Max: 35m 24s 🟩 TestGPU Pass: 100%/8 | Total: 3h 41m | Avg: 27m 38s | Max: 30m 14s 🟨 ctk 🟨 11.1 Pass: 93%/15 | Total: 1h 01m | Avg: 4m 04s | Max: 10m 42s 🟩 11.8 Pass: 100%/3 | Total: 13m 50s | Avg: 4m 36s | Max: 4m 57s 🟨 12.6 Pass: 95%/118 | Total: 18h 11m | Avg: 9m 14s | Max: 35m 24s 🟨 cudacxx 🟩 ClangCUDA18 Pass: 100%/2 | Total: 7m 19s | Avg: 3m 39s | Max: 3m 49s 🟨 nvcc11.1 Pass: 93%/15 | Total: 1h 01m | Avg: 4m 04s | Max: 10m 42s 🟩 nvcc11.8 Pass: 100%/3 | Total: 13m 50s | Avg: 4m 36s | Max: 4m 57s 🟨 nvcc12.6 Pass: 95%/116 | Total: 18h 04m | Avg: 9m 20s | Max: 35m 24s 🟨 cxx 🟩 Clang9 Pass: 100%/6 | Total: 26m 22s | Avg: 4m 23s | Max: 5m 25s 🟩 Clang10 Pass: 100%/3 | Total: 15m 38s | Avg: 5m 12s | Max: 5m 21s 🟩 Clang11 Pass: 100%/4 | Total: 17m 44s | Avg: 4m 26s | Max: 4m 45s 🟩 Clang12 Pass: 100%/4 | Total: 18m 19s | Avg: 4m 34s | Max: 4m 43s 🟩 Clang13 Pass: 100%/4 | Total: 18m 26s | Avg: 4m 36s | Max: 4m 46s 🟩 Clang14 Pass: 100%/4 | Total: 17m 47s | Avg: 4m 26s | Max: 4m 39s 🟩 Clang15 Pass: 100%/4 | Total: 18m 36s | Avg: 4m 39s | Max: 4m 53s 🟩 Clang16 Pass: 100%/4 | Total: 19m 03s | Avg: 4m 45s | Max: 5m 13s 🟩 Clang17 Pass: 100%/4 | Total: 18m 14s | Avg: 4m 33s | Max: 4m 43s 🟩 Clang18 Pass: 100%/26 | Total: 6h 14m | Avg: 14m 24s | Max: 30m 14s 🟩 GCC6 Pass: 100%/2 | Total: 6m 57s | Avg: 3m 28s | Max: 3m 31s 🟩 GCC7 Pass: 100%/6 | Total: 23m 23s | Avg: 3m 53s | Max: 4m 31s 🟩 GCC8 Pass: 100%/6 | Total: 23m 30s | Avg: 3m 55s | Max: 4m 27s 🟩 GCC9 Pass: 100%/6 | Total: 24m 35s | Avg: 4m 05s | Max: 4m 32s 🟩 GCC10 Pass: 100%/4 | Total: 19m 01s | Avg: 4m 45s | Max: 5m 07s 🟩 GCC11 Pass: 100%/7 | Total: 32m 33s | Avg: 4m 39s | Max: 4m 57s 🟩 GCC12 Pass: 100%/4 | Total: 18m 46s | Avg: 4m 41s | Max: 4m 54s 🟩 GCC13 Pass: 100%/29 | Total: 6h 38m | Avg: 13m 45s | Max: 35m 24s 🟩 Intel2023.2.0 Pass: 100%/3 | Total: 16m 34s | Avg: 5m 31s | Max: 5m 42s 🟥 MSVC14.16 Pass: 0%/1 | Total: 10m 42s | Avg: 10m 42s | Max: 10m 42s 🟥 MSVC14.29 Pass: 0%/2 | Total: 18m 33s | Avg: 9m 16s | Max: 9m 24s 🟥 MSVC14.39 Pass: 0%/3 | Total: 28m 11s | Avg: 9m 23s | Max: 9m 39s 🟨 std 🟩 11 Pass: 100%/35 | Total: 4h 34m | Avg: 7m 50s | Max: 30m 14s 🟨 14 Pass: 92%/38 | Total: 5h 08m | Avg: 8m 07s | Max: 26m 43s 🟨 17 Pass: 94%/38 | Total: 5h 38m | Avg: 8m 54s | Max: 35m 24s 🟨 20 Pass: 96%/25 | Total: 4h 05m | Avg: 9m 48s | Max: 28m 21s 🟨 gpu 🟨 v100 Pass: 95%/136 | Total: 19h 26m | Avg: 8m 34s | Max: 35m 24s 🟩 sm 🟩 60;70;80;90 Pass: 100%/3 | Total: 13m 50s | Avg: 4m 36s | Max: 4m 57s 🟩 90a Pass: 100%/4 | Total: 14m 45s | Avg: 3m 41s | Max: 3m 49s -
🟩 thrust: Pass: 100%/122 | Total: 12h 35m | Avg: 6m 11s | Max: 21m 04s | Hits: 99%/20079
🟩 cpu 🟩 amd64 Pass: 100%/114 | Total: 12h 00m | Avg: 6m 19s | Max: 21m 04s | Hits: 99%/20079 🟩 arm64 Pass: 100%/8 | Total: 35m 04s | Avg: 4m 23s | Max: 5m 14s 🟩 ctk 🟩 11.1 Pass: 100%/15 | Total: 1h 08m | Avg: 4m 35s | Max: 16m 04s | Hits: 99%/2231 🟩 11.8 Pass: 100%/3 | Total: 13m 21s | Avg: 4m 27s | Max: 4m 36s 🟩 12.6 Pass: 100%/104 | Total: 11h 13m | Avg: 6m 28s | Max: 21m 04s | Hits: 99%/17848 🟩 cudacxx 🟩 ClangCUDA18 Pass: 100%/2 | Total: 8m 38s | Avg: 4m 19s | Max: 4m 23s 🟩 nvcc11.1 Pass: 100%/15 | Total: 1h 08m | Avg: 4m 35s | Max: 16m 04s | Hits: 99%/2231 🟩 nvcc11.8 Pass: 100%/3 | Total: 13m 21s | Avg: 4m 27s | Max: 4m 36s 🟩 nvcc12.6 Pass: 100%/102 | Total: 11h 04m | Avg: 6m 30s | Max: 21m 04s | Hits: 99%/17848 🟩 cudacxx_family 🟩 ClangCUDA Pass: 100%/2 | Total: 8m 38s | Avg: 4m 19s | Max: 4m 23s 🟩 nvcc Pass: 100%/120 | Total: 12h 26m | Avg: 6m 13s | Max: 21m 04s | Hits: 99%/20079 🟩 cxx 🟩 Clang9 Pass: 100%/6 | Total: 29m 37s | Avg: 4m 56s | Max: 6m 26s 🟩 Clang10 Pass: 100%/3 | Total: 17m 33s | Avg: 5m 51s | Max: 6m 18s 🟩 Clang11 Pass: 100%/4 | Total: 18m 32s | Avg: 4m 38s | Max: 5m 00s 🟩 Clang12 Pass: 100%/4 | Total: 18m 43s | Avg: 4m 40s | Max: 5m 00s 🟩 Clang13 Pass: 100%/4 | Total: 19m 25s | Avg: 4m 51s | Max: 5m 11s 🟩 Clang14 Pass: 100%/4 | Total: 18m 48s | Avg: 4m 42s | Max: 4m 49s 🟩 Clang15 Pass: 100%/4 | Total: 19m 15s | Avg: 4m 48s | Max: 5m 06s 🟩 Clang16 Pass: 100%/4 | Total: 18m 36s | Avg: 4m 39s | Max: 4m 49s 🟩 Clang17 Pass: 100%/4 | Total: 18m 12s | Avg: 4m 33s | Max: 4m 56s 🟩 Clang18 Pass: 100%/18 | Total: 2h 00m | Avg: 6m 41s | Max: 14m 53s 🟩 GCC6 Pass: 100%/2 | Total: 6m 57s | Avg: 3m 28s | Max: 3m 31s 🟩 GCC7 Pass: 100%/6 | Total: 23m 55s | Avg: 3m 59s | Max: 4m 49s 🟩 GCC8 Pass: 100%/6 | Total: 24m 19s | Avg: 4m 03s | Max: 4m 33s 🟩 GCC9 Pass: 100%/6 | Total: 24m 58s | Avg: 4m 09s | Max: 4m 54s 🟩 GCC10 Pass: 100%/4 | Total: 18m 08s | Avg: 4m 32s | Max: 4m 41s 🟩 GCC11 Pass: 100%/7 | Total: 31m 38s | Avg: 4m 31s | Max: 4m 40s 🟩 GCC12 Pass: 100%/4 | Total: 19m 06s | Avg: 4m 46s | Max: 5m 03s 🟩 GCC13 Pass: 100%/20 | Total: 2h 19m | Avg: 6m 59s | Max: 15m 30s 🟩 Intel2023.2.0 Pass: 100%/3 | Total: 16m 57s | Avg: 5m 39s | Max: 5m 54s 🟩 MSVC14.16 Pass: 100%/1 | Total: 16m 04s | Avg: 16m 04s | Max: 16m 04s | Hits: 99%/2231 🟩 MSVC14.29 Pass: 100%/2 | Total: 27m 01s | Avg: 13m 30s | Max: 13m 39s | Hits: 99%/4462 🟩 MSVC14.39 Pass: 100%/6 | Total: 1h 47m | Avg: 17m 53s | Max: 21m 04s | Hits: 99%/13386 🟩 cxx_family 🟩 Clang Pass: 100%/55 | Total: 4h 59m | Avg: 5m 26s | Max: 14m 53s 🟩 GCC Pass: 100%/55 | Total: 4h 48m | Avg: 5m 15s | Max: 15m 30s 🟩 Intel Pass: 100%/3 | Total: 16m 57s | Avg: 5m 39s | Max: 5m 54s 🟩 MSVC Pass: 100%/9 | Total: 2h 30m | Avg: 16m 42s | Max: 21m 04s | Hits: 99%/20079 🟩 gpu 🟩 v100 Pass: 100%/122 | Total: 12h 35m | Avg: 6m 11s | Max: 21m 04s | Hits: 99%/20079 🟩 jobs 🟩 Build Pass: 100%/103 | Total: 8h 50m | Avg: 5m 09s | Max: 16m 11s | Hits: 99%/13386 🟩 TestCPU Pass: 100%/11 | Total: 1h 58m | Avg: 10m 46s | Max: 21m 04s | Hits: 99%/6693 🟩 TestGPU Pass: 100%/8 | Total: 1h 46m | Avg: 13m 15s | Max: 15m 30s 🟩 sm 🟩 60;70;80;90 Pass: 100%/3 | Total: 13m 21s | Avg: 4m 27s | Max: 4m 36s 🟩 90a Pass: 100%/4 | Total: 15m 53s | Avg: 3m 58s | Max: 4m 20s 🟩 std 🟩 11 Pass: 100%/31 | Total: 2h 33m | Avg: 4m 57s | Max: 15m 30s 🟩 14 Pass: 100%/35 | Total: 3h 50m | Avg: 6m 34s | Max: 19m 59s | Hits: 99%/8924 🟩 17 Pass: 100%/34 | Total: 3h 37m | Avg: 6m 23s | Max: 21m 01s | Hits: 99%/6693 🟩 20 Pass: 100%/22 | Total: 2h 34m | Avg: 7m 00s | Max: 21m 04s | Hits: 99%/4462 -
🟩 pycuda: Pass: 100%/1 | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s
🟩 cpu 🟩 amd64 Pass: 100%/1 | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s 🟩 ctk 🟩 12.5 Pass: 100%/1 | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s 🟩 cudacxx 🟩 nvcc12.5 Pass: 100%/1 | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s 🟩 cudacxx_family 🟩 nvcc Pass: 100%/1 | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s 🟩 cxx 🟩 GCC13 Pass: 100%/1 | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s 🟩 cxx_family 🟩 GCC Pass: 100%/1 | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s 🟩 gpu 🟩 v100 Pass: 100%/1 | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s 🟩 jobs 🟩 Test Pass: 100%/1 | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s
👃 Inspect Changes
Modifications in project?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| pycuda | |
| CUDA C Core Library |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | pycuda |
| +/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 259)
| # | Runner |
|---|---|
| 186 | linux-amd64-cpu16 |
| 42 | linux-amd64-gpu-v100-latest-1 |
| 16 | linux-arm64-cpu16 |
| 15 | windows-amd64-cpu16 |
🟨 CI finished in 4h 42m: Pass: 94%/259 | Total: 5d 01h | Avg: 28m 13s | Max: 1h 39m | Hits: 99%/20079
-
🟨 cub: Pass: 88%/136 | Total: 4d 12h | Avg: 47m 41s | Max: 1h 39m
🔍 cpu: amd64 🔍 🔍 amd64 Pass: 88%/128 | Total: 4d 05h | Avg: 47m 22s | Max: 1h 39m 🟩 arm64 Pass: 100%/8 | Total: 7h 01m | Avg: 52m 41s | Max: 54m 14s 🔍 cudacxx_family: nvcc 🔍 🟩 ClangCUDA Pass: 100%/2 | Total: 1h 57m | Avg: 58m 50s | Max: 59m 38s 🔍 nvcc Pass: 88%/134 | Total: 4d 10h | Avg: 47m 31s | Max: 1h 39m 🟨 ctk 🟨 11.1 Pass: 93%/15 | Total: 1h 07m | Avg: 4m 30s | Max: 14m 31s 🟩 11.8 Pass: 100%/3 | Total: 13m 21s | Avg: 4m 27s | Max: 4m 36s 🟨 12.6 Pass: 88%/118 | Total: 4d 10h | Avg: 54m 16s | Max: 1h 39m 🟨 cudacxx 🟩 ClangCUDA18 Pass: 100%/2 | Total: 1h 57m | Avg: 58m 50s | Max: 59m 38s 🟨 nvcc11.1 Pass: 93%/15 | Total: 1h 07m | Avg: 4m 30s | Max: 14m 31s 🟩 nvcc11.8 Pass: 100%/3 | Total: 13m 21s | Avg: 4m 27s | Max: 4m 36s 🟨 nvcc12.6 Pass: 87%/116 | Total: 4d 08h | Avg: 54m 11s | Max: 1h 39m 🟨 cxx 🟩 Clang9 Pass: 100%/6 | Total: 2h 30m | Avg: 25m 07s | Max: 48m 51s 🟩 Clang10 Pass: 100%/3 | Total: 2h 15m | Avg: 45m 00s | Max: 47m 01s 🟩 Clang11 Pass: 100%/4 | Total: 2h 55m | Avg: 43m 51s | Max: 45m 13s 🟩 Clang12 Pass: 100%/4 | Total: 3h 01m | Avg: 45m 24s | Max: 47m 22s 🟩 Clang13 Pass: 100%/4 | Total: 3h 00m | Avg: 45m 14s | Max: 47m 45s 🟩 Clang14 Pass: 100%/4 | Total: 2h 53m | Avg: 43m 20s | Max: 43m 54s 🟩 Clang15 Pass: 100%/4 | Total: 2h 52m | Avg: 43m 14s | Max: 43m 46s 🟩 Clang16 Pass: 100%/4 | Total: 2h 57m | Avg: 44m 28s | Max: 47m 08s 🟩 Clang17 Pass: 100%/4 | Total: 2h 55m | Avg: 43m 51s | Max: 45m 31s 🟨 Clang18 Pass: 84%/26 | Total: 1d 07h | Avg: 1h 11m | Max: 1h 34m 🟩 GCC6 Pass: 100%/2 | Total: 7m 00s | Avg: 3m 30s | Max: 3m 41s 🟩 GCC7 Pass: 100%/6 | Total: 2h 23m | Avg: 23m 52s | Max: 45m 11s 🟩 GCC8 Pass: 100%/6 | Total: 2h 54m | Avg: 29m 00s | Max: 1h 12m 🟩 GCC9 Pass: 100%/6 | Total: 2h 28m | Avg: 24m 42s | Max: 46m 30s 🟩 GCC10 Pass: 100%/4 | Total: 2h 58m | Avg: 44m 35s | Max: 47m 43s 🟩 GCC11 Pass: 100%/7 | Total: 3h 14m | Avg: 27m 49s | Max: 46m 54s 🟩 GCC12 Pass: 100%/4 | Total: 3h 03m | Avg: 45m 51s | Max: 48m 13s 🟨 GCC13 Pass: 82%/29 | Total: 1d 06h | Avg: 1h 04m | Max: 1h 39m 🟩 Intel2023.2.0 Pass: 100%/3 | Total: 2h 17m | Avg: 45m 54s | Max: 48m 09s 🟥 MSVC14.16 Pass: 0%/1 | Total: 14m 31s | Avg: 14m 31s | Max: 14m 31s 🟥 MSVC14.29 Pass: 0%/2 | Total: 24m 02s | Avg: 12m 01s | Max: 12m 25s 🟥 MSVC14.39 Pass: 0%/3 | Total: 38m 46s | Avg: 12m 55s | Max: 13m 28s 🟨 cxx_family 🟨 Clang Pass: 93%/63 | Total: 2d 08h | Avg: 53m 42s | Max: 1h 34m 🟨 GCC Pass: 92%/64 | Total: 2d 00h | Avg: 45m 06s | Max: 1h 39m 🟩 Intel Pass: 100%/3 | Total: 2h 17m | Avg: 45m 54s | Max: 48m 09s 🟥 MSVC Pass: 0%/6 | Total: 1h 17m | Avg: 12m 53s | Max: 14m 31s 🟨 jobs 🟨 Build Pass: 94%/103 | Total: 2d 13h | Avg: 35m 43s | Max: 1h 12m 🟩 DeviceLaunch Pass: 100%/8 | Total: 11h 01m | Avg: 1h 22m | Max: 1h 25m 🟩 GraphCapture Pass: 100%/8 | Total: 10h 59m | Avg: 1h 22m | Max: 1h 29m 🟩 HostLaunch Pass: 100%/8 | Total: 11h 01m | Avg: 1h 22m | Max: 1h 27m 🟥 SmallGMem Pass: 0%/1 | Total: 1h 39m | Avg: 1h 39m | Max: 1h 39m 🟥 TestGPU Pass: 0%/8 | Total: 12h 02m | Avg: 1h 30m | Max: 1h 36m 🟨 gpu 🟨 v100 Pass: 88%/136 | Total: 4d 12h | Avg: 47m 41s | Max: 1h 39m 🟩 sm 🟩 60;70;80;90 Pass: 100%/3 | Total: 13m 21s | Avg: 4m 27s | Max: 4m 36s 🟩 90a Pass: 100%/4 | Total: 15m 11s | Avg: 3m 47s | Max: 3m 55s 🟨 std 🟨 11 Pass: 94%/35 | Total: 1d 02h | Avg: 45m 56s | Max: 1h 30m 🟨 14 Pass: 86%/38 | Total: 1d 03h | Avg: 43m 33s | Max: 1h 33m 🟨 17 Pass: 86%/38 | Total: 1d 06h | Avg: 47m 27s | Max: 1h 39m 🟨 20 Pass: 88%/25 | Total: 23h 38m | Avg: 56m 44s | Max: 1h 36m -
🟩 thrust: Pass: 100%/122 | Total: 13h 27m | Avg: 6m 37s | Max: 22m 34s | Hits: 99%/20079
🟩 cpu 🟩 amd64 Pass: 100%/114 | Total: 12h 49m | Avg: 6m 45s | Max: 22m 34s | Hits: 99%/20079 🟩 arm64 Pass: 100%/8 | Total: 37m 43s | Avg: 4m 42s | Max: 5m 21s 🟩 ctk 🟩 11.1 Pass: 100%/15 | Total: 1h 12m | Avg: 4m 50s | Max: 18m 34s | Hits: 99%/2231 🟩 11.8 Pass: 100%/3 | Total: 14m 18s | Avg: 4m 46s | Max: 4m 59s 🟩 12.6 Pass: 100%/104 | Total: 12h 00m | Avg: 6m 55s | Max: 22m 34s | Hits: 99%/17848 🟩 cudacxx 🟩 ClangCUDA18 Pass: 100%/2 | Total: 8m 51s | Avg: 4m 25s | Max: 4m 26s 🟩 nvcc11.1 Pass: 100%/15 | Total: 1h 12m | Avg: 4m 50s | Max: 18m 34s | Hits: 99%/2231 🟩 nvcc11.8 Pass: 100%/3 | Total: 14m 18s | Avg: 4m 46s | Max: 4m 59s 🟩 nvcc12.6 Pass: 100%/102 | Total: 11h 51m | Avg: 6m 58s | Max: 22m 34s | Hits: 99%/17848 🟩 cudacxx_family 🟩 ClangCUDA Pass: 100%/2 | Total: 8m 51s | Avg: 4m 25s | Max: 4m 26s 🟩 nvcc Pass: 100%/120 | Total: 13h 18m | Avg: 6m 39s | Max: 22m 34s | Hits: 99%/20079 🟩 cxx 🟩 Clang9 Pass: 100%/6 | Total: 30m 48s | Avg: 5m 08s | Max: 5m 55s 🟩 Clang10 Pass: 100%/3 | Total: 16m 46s | Avg: 5m 35s | Max: 5m 57s 🟩 Clang11 Pass: 100%/4 | Total: 18m 12s | Avg: 4m 33s | Max: 4m 37s 🟩 Clang12 Pass: 100%/4 | Total: 18m 15s | Avg: 4m 33s | Max: 4m 46s 🟩 Clang13 Pass: 100%/4 | Total: 18m 05s | Avg: 4m 31s | Max: 4m 39s 🟩 Clang14 Pass: 100%/4 | Total: 18m 18s | Avg: 4m 34s | Max: 4m 43s 🟩 Clang15 Pass: 100%/4 | Total: 19m 15s | Avg: 4m 48s | Max: 5m 14s 🟩 Clang16 Pass: 100%/4 | Total: 18m 38s | Avg: 4m 39s | Max: 5m 00s 🟩 Clang17 Pass: 100%/4 | Total: 18m 45s | Avg: 4m 41s | Max: 4m 52s 🟩 Clang18 Pass: 100%/18 | Total: 2h 15m | Avg: 7m 30s | Max: 18m 36s 🟩 GCC6 Pass: 100%/2 | Total: 7m 06s | Avg: 3m 33s | Max: 3m 50s 🟩 GCC7 Pass: 100%/6 | Total: 24m 38s | Avg: 4m 06s | Max: 4m 44s 🟩 GCC8 Pass: 100%/6 | Total: 24m 11s | Avg: 4m 01s | Max: 4m 44s 🟩 GCC9 Pass: 100%/6 | Total: 25m 26s | Avg: 4m 14s | Max: 4m 52s 🟩 GCC10 Pass: 100%/4 | Total: 18m 45s | Avg: 4m 41s | Max: 5m 00s 🟩 GCC11 Pass: 100%/7 | Total: 33m 43s | Avg: 4m 49s | Max: 5m 02s 🟩 GCC12 Pass: 100%/4 | Total: 20m 07s | Avg: 5m 01s | Max: 5m 10s 🟩 GCC13 Pass: 100%/20 | Total: 2h 32m | Avg: 7m 36s | Max: 19m 31s 🟩 Intel2023.2.0 Pass: 100%/3 | Total: 17m 50s | Avg: 5m 56s | Max: 6m 06s 🟩 MSVC14.16 Pass: 100%/1 | Total: 18m 34s | Avg: 18m 34s | Max: 18m 34s | Hits: 99%/2231 🟩 MSVC14.29 Pass: 100%/2 | Total: 32m 57s | Avg: 16m 28s | Max: 16m 46s | Hits: 99%/4462 🟩 MSVC14.39 Pass: 100%/6 | Total: 1h 59m | Avg: 19m 57s | Max: 22m 34s | Hits: 99%/13386 🟩 cxx_family 🟩 Clang Pass: 100%/55 | Total: 5h 12m | Avg: 5m 40s | Max: 18m 36s 🟩 GCC Pass: 100%/55 | Total: 5h 06m | Avg: 5m 34s | Max: 19m 31s 🟩 Intel Pass: 100%/3 | Total: 17m 50s | Avg: 5m 56s | Max: 6m 06s 🟩 MSVC Pass: 100%/9 | Total: 2h 51m | Avg: 19m 01s | Max: 22m 34s | Hits: 99%/20079 🟩 gpu 🟩 v100 Pass: 100%/122 | Total: 13h 27m | Avg: 6m 37s | Max: 22m 34s | Hits: 99%/20079 🟩 jobs 🟩 Build Pass: 100%/103 | Total: 9h 14m | Avg: 5m 22s | Max: 18m 50s | Hits: 99%/13386 🟩 TestCPU Pass: 100%/11 | Total: 2h 05m | Avg: 11m 23s | Max: 22m 34s | Hits: 99%/6693 🟩 TestGPU Pass: 100%/8 | Total: 2h 07m | Avg: 15m 59s | Max: 19m 31s 🟩 sm 🟩 60;70;80;90 Pass: 100%/3 | Total: 14m 18s | Avg: 4m 46s | Max: 4m 59s 🟩 90a Pass: 100%/4 | Total: 15m 40s | Avg: 3m 55s | Max: 4m 02s 🟩 std 🟩 11 Pass: 100%/31 | Total: 2h 46m | Avg: 5m 22s | Max: 19m 26s 🟩 14 Pass: 100%/35 | Total: 4h 09m | Avg: 7m 08s | Max: 22m 03s | Hits: 99%/8924 🟩 17 Pass: 100%/34 | Total: 3h 45m | Avg: 6m 38s | Max: 22m 34s | Hits: 99%/6693 🟩 20 Pass: 100%/22 | Total: 2h 45m | Avg: 7m 30s | Max: 21m 58s | Hits: 99%/4462 -
🟩 pycuda: Pass: 100%/1 | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s
🟩 cpu 🟩 amd64 Pass: 100%/1 | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s 🟩 ctk 🟩 12.5 Pass: 100%/1 | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s 🟩 cudacxx 🟩 nvcc12.5 Pass: 100%/1 | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s 🟩 cudacxx_family 🟩 nvcc Pass: 100%/1 | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s 🟩 cxx 🟩 GCC13 Pass: 100%/1 | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s 🟩 cxx_family 🟩 GCC Pass: 100%/1 | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s 🟩 gpu 🟩 v100 Pass: 100%/1 | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s 🟩 jobs 🟩 Test Pass: 100%/1 | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s
👃 Inspect Changes
Modifications in project?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| pycuda | |
| CUDA C Core Library |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | pycuda |
| +/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 259)
| # | Runner |
|---|---|
| 186 | linux-amd64-cpu16 |
| 42 | linux-amd64-gpu-v100-latest-1 |
| 16 | linux-arm64-cpu16 |
| 15 | windows-amd64-cpu16 |
🟨 CI finished in 8h 06m: Pass: 94%/259 | Total: 5d 02h | Avg: 28m 19s | Max: 1h 34m | Hits: 99%/20079
-
🟨 cub: Pass: 88%/136 | Total: 4d 12h | Avg: 47m 44s | Max: 1h 34m
🔍 cpu: amd64 🔍 🔍 amd64 Pass: 88%/128 | Total: 4d 05h | Avg: 47m 21s | Max: 1h 34m 🟩 arm64 Pass: 100%/8 | Total: 7h 11m | Avg: 53m 53s | Max: 57m 36s 🔍 cudacxx_family: nvcc 🔍 🟩 ClangCUDA Pass: 100%/2 | Total: 2h 08m | Avg: 1h 04m | Max: 1h 05m 🔍 nvcc Pass: 88%/134 | Total: 4d 10h | Avg: 47m 29s | Max: 1h 34m 🟨 ctk 🟨 11.1 Pass: 93%/15 | Total: 1h 07m | Avg: 4m 31s | Max: 14m 48s 🟩 11.8 Pass: 100%/3 | Total: 14m 06s | Avg: 4m 42s | Max: 4m 54s 🟨 12.6 Pass: 88%/118 | Total: 4d 10h | Avg: 54m 19s | Max: 1h 34m 🟨 cudacxx 🟩 ClangCUDA18 Pass: 100%/2 | Total: 2h 08m | Avg: 1h 04m | Max: 1h 05m 🟨 nvcc11.1 Pass: 93%/15 | Total: 1h 07m | Avg: 4m 31s | Max: 14m 48s 🟩 nvcc11.8 Pass: 100%/3 | Total: 14m 06s | Avg: 4m 42s | Max: 4m 54s 🟨 nvcc12.6 Pass: 87%/116 | Total: 4d 08h | Avg: 54m 09s | Max: 1h 34m 🟨 cxx 🟩 Clang9 Pass: 100%/6 | Total: 2h 27m | Avg: 24m 34s | Max: 46m 01s 🟩 Clang10 Pass: 100%/3 | Total: 2h 16m | Avg: 45m 28s | Max: 47m 07s 🟩 Clang11 Pass: 100%/4 | Total: 2h 59m | Avg: 44m 49s | Max: 46m 31s 🟩 Clang12 Pass: 100%/4 | Total: 2h 58m | Avg: 44m 42s | Max: 48m 46s 🟩 Clang13 Pass: 100%/4 | Total: 3h 01m | Avg: 45m 15s | Max: 48m 00s 🟩 Clang14 Pass: 100%/4 | Total: 3h 03m | Avg: 45m 46s | Max: 47m 53s 🟩 Clang15 Pass: 100%/4 | Total: 2h 57m | Avg: 44m 17s | Max: 46m 06s 🟩 Clang16 Pass: 100%/4 | Total: 2h 54m | Avg: 43m 38s | Max: 45m 19s 🟩 Clang17 Pass: 100%/4 | Total: 3h 01m | Avg: 45m 25s | Max: 47m 22s 🟨 Clang18 Pass: 84%/26 | Total: 1d 07h | Avg: 1h 12m | Max: 1h 34m 🟩 GCC6 Pass: 100%/2 | Total: 6m 59s | Avg: 3m 29s | Max: 3m 34s 🟩 GCC7 Pass: 100%/6 | Total: 2h 22m | Avg: 23m 44s | Max: 44m 38s 🟩 GCC8 Pass: 100%/6 | Total: 2h 20m | Avg: 23m 27s | Max: 43m 37s 🟩 GCC9 Pass: 100%/6 | Total: 2h 27m | Avg: 24m 31s | Max: 47m 41s 🟩 GCC10 Pass: 100%/4 | Total: 2h 59m | Avg: 44m 52s | Max: 45m 38s 🟩 GCC11 Pass: 100%/7 | Total: 3h 13m | Avg: 27m 36s | Max: 46m 19s 🟩 GCC12 Pass: 100%/4 | Total: 3h 34m | Avg: 53m 39s | Max: 1h 17m 🟨 GCC13 Pass: 82%/29 | Total: 1d 06h | Avg: 1h 03m | Max: 1h 34m 🟩 Intel2023.2.0 Pass: 100%/3 | Total: 2h 18m | Avg: 46m 08s | Max: 48m 33s 🟥 MSVC14.16 Pass: 0%/1 | Total: 14m 48s | Avg: 14m 48s | Max: 14m 48s 🟥 MSVC14.29 Pass: 0%/2 | Total: 25m 11s | Avg: 12m 35s | Max: 12m 36s 🟥 MSVC14.39 Pass: 0%/3 | Total: 41m 00s | Avg: 13m 40s | Max: 14m 26s 🟨 cxx_family 🟨 Clang Pass: 93%/63 | Total: 2d 08h | Avg: 54m 09s | Max: 1h 34m 🟨 GCC Pass: 92%/64 | Total: 1d 23h | Avg: 44m 42s | Max: 1h 34m 🟩 Intel Pass: 100%/3 | Total: 2h 18m | Avg: 46m 08s | Max: 48m 33s 🟥 MSVC Pass: 0%/6 | Total: 1h 20m | Avg: 13m 29s | Max: 14m 48s 🟨 jobs 🟨 Build Pass: 94%/103 | Total: 2d 13h | Avg: 36m 04s | Max: 1h 17m 🟩 DeviceLaunch Pass: 100%/8 | Total: 11h 02m | Avg: 1h 22m | Max: 1h 27m 🟩 GraphCapture Pass: 100%/8 | Total: 10h 44m | Avg: 1h 20m | Max: 1h 25m 🟩 HostLaunch Pass: 100%/8 | Total: 11h 08m | Avg: 1h 23m | Max: 1h 29m 🟥 SmallGMem Pass: 0%/1 | Total: 1h 34m | Avg: 1h 34m | Max: 1h 34m 🟥 TestGPU Pass: 0%/8 | Total: 11h 48m | Avg: 1h 28m | Max: 1h 34m 🟨 gpu 🟨 v100 Pass: 88%/136 | Total: 4d 12h | Avg: 47m 44s | Max: 1h 34m 🟩 sm 🟩 60;70;80;90 Pass: 100%/3 | Total: 14m 06s | Avg: 4m 42s | Max: 4m 54s 🟩 90a Pass: 100%/4 | Total: 15m 27s | Avg: 3m 51s | Max: 4m 05s 🟨 std 🟨 11 Pass: 94%/35 | Total: 1d 03h | Avg: 46m 34s | Max: 1h 28m 🟨 14 Pass: 86%/38 | Total: 1d 03h | Avg: 42m 41s | Max: 1h 26m 🟨 17 Pass: 86%/38 | Total: 1d 06h | Avg: 47m 56s | Max: 1h 34m 🟨 20 Pass: 88%/25 | Total: 23h 38m | Avg: 56m 44s | Max: 1h 33m -
🟩 thrust: Pass: 100%/122 | Total: 13h 45m | Avg: 6m 46s | Max: 25m 49s | Hits: 99%/20079
🟩 cpu 🟩 amd64 Pass: 100%/114 | Total: 13h 10m | Avg: 6m 56s | Max: 25m 49s | Hits: 99%/20079 🟩 arm64 Pass: 100%/8 | Total: 34m 56s | Avg: 4m 22s | Max: 4m 48s 🟩 ctk 🟩 11.1 Pass: 100%/15 | Total: 1h 13m | Avg: 4m 52s | Max: 20m 32s | Hits: 99%/2231 🟩 11.8 Pass: 100%/3 | Total: 13m 40s | Avg: 4m 33s | Max: 4m 53s 🟩 12.6 Pass: 100%/104 | Total: 12h 18m | Avg: 7m 06s | Max: 25m 49s | Hits: 99%/17848 🟩 cudacxx 🟩 ClangCUDA18 Pass: 100%/2 | Total: 9m 18s | Avg: 4m 39s | Max: 4m 44s 🟩 nvcc11.1 Pass: 100%/15 | Total: 1h 13m | Avg: 4m 52s | Max: 20m 32s | Hits: 99%/2231 🟩 nvcc11.8 Pass: 100%/3 | Total: 13m 40s | Avg: 4m 33s | Max: 4m 53s 🟩 nvcc12.6 Pass: 100%/102 | Total: 12h 09m | Avg: 7m 09s | Max: 25m 49s | Hits: 99%/17848 🟩 cudacxx_family 🟩 ClangCUDA Pass: 100%/2 | Total: 9m 18s | Avg: 4m 39s | Max: 4m 44s 🟩 nvcc Pass: 100%/120 | Total: 13h 36m | Avg: 6m 48s | Max: 25m 49s | Hits: 99%/20079 🟩 cxx 🟩 Clang9 Pass: 100%/6 | Total: 28m 11s | Avg: 4m 41s | Max: 5m 43s 🟩 Clang10 Pass: 100%/3 | Total: 17m 05s | Avg: 5m 41s | Max: 6m 17s 🟩 Clang11 Pass: 100%/4 | Total: 19m 09s | Avg: 4m 47s | Max: 4m 58s 🟩 Clang12 Pass: 100%/4 | Total: 18m 41s | Avg: 4m 40s | Max: 4m 48s 🟩 Clang13 Pass: 100%/4 | Total: 19m 37s | Avg: 4m 54s | Max: 5m 48s 🟩 Clang14 Pass: 100%/4 | Total: 18m 53s | Avg: 4m 43s | Max: 5m 01s 🟩 Clang15 Pass: 100%/4 | Total: 19m 43s | Avg: 4m 55s | Max: 5m 25s 🟩 Clang16 Pass: 100%/4 | Total: 19m 09s | Avg: 4m 47s | Max: 5m 03s 🟩 Clang17 Pass: 100%/4 | Total: 18m 47s | Avg: 4m 41s | Max: 4m 56s 🟩 Clang18 Pass: 100%/18 | Total: 2h 28m | Avg: 8m 15s | Max: 23m 35s 🟩 GCC6 Pass: 100%/2 | Total: 7m 07s | Avg: 3m 33s | Max: 3m 49s 🟩 GCC7 Pass: 100%/6 | Total: 24m 04s | Avg: 4m 00s | Max: 4m 41s 🟩 GCC8 Pass: 100%/6 | Total: 25m 11s | Avg: 4m 11s | Max: 4m 38s 🟩 GCC9 Pass: 100%/6 | Total: 25m 03s | Avg: 4m 10s | Max: 5m 09s 🟩 GCC10 Pass: 100%/4 | Total: 18m 43s | Avg: 4m 40s | Max: 5m 02s 🟩 GCC11 Pass: 100%/7 | Total: 33m 20s | Avg: 4m 45s | Max: 5m 28s 🟩 GCC12 Pass: 100%/4 | Total: 19m 33s | Avg: 4m 53s | Max: 5m 07s 🟩 GCC13 Pass: 100%/20 | Total: 2h 23m | Avg: 7m 09s | Max: 16m 05s 🟩 Intel2023.2.0 Pass: 100%/3 | Total: 18m 12s | Avg: 6m 04s | Max: 6m 29s 🟩 MSVC14.16 Pass: 100%/1 | Total: 20m 32s | Avg: 20m 32s | Max: 20m 32s | Hits: 99%/2231 🟩 MSVC14.29 Pass: 100%/2 | Total: 34m 31s | Avg: 17m 15s | Max: 17m 34s | Hits: 99%/4462 🟩 MSVC14.39 Pass: 100%/6 | Total: 2h 08m | Avg: 21m 23s | Max: 25m 49s | Hits: 99%/13386 🟩 cxx_family 🟩 Clang Pass: 100%/55 | Total: 5h 27m | Avg: 5m 57s | Max: 23m 35s 🟩 GCC Pass: 100%/55 | Total: 4h 56m | Avg: 5m 23s | Max: 16m 05s 🟩 Intel Pass: 100%/3 | Total: 18m 12s | Avg: 6m 04s | Max: 6m 29s 🟩 MSVC Pass: 100%/9 | Total: 3h 03m | Avg: 20m 22s | Max: 25m 49s | Hits: 99%/20079 🟩 gpu 🟩 v100 Pass: 100%/122 | Total: 13h 45m | Avg: 6m 46s | Max: 25m 49s | Hits: 99%/20079 🟩 jobs 🟩 Build Pass: 100%/103 | Total: 9h 19m | Avg: 5m 26s | Max: 20m 32s | Hits: 99%/13386 🟩 TestCPU Pass: 100%/11 | Total: 2h 11m | Avg: 11m 56s | Max: 25m 49s | Hits: 99%/6693 🟩 TestGPU Pass: 100%/8 | Total: 2h 14m | Avg: 16m 49s | Max: 23m 35s 🟩 sm 🟩 60;70;80;90 Pass: 100%/3 | Total: 13m 40s | Avg: 4m 33s | Max: 4m 53s 🟩 90a Pass: 100%/4 | Total: 16m 09s | Avg: 4m 02s | Max: 4m 19s 🟩 std 🟩 11 Pass: 100%/31 | Total: 2h 46m | Avg: 5m 22s | Max: 23m 35s 🟩 14 Pass: 100%/35 | Total: 4h 03m | Avg: 6m 58s | Max: 22m 44s | Hits: 99%/8924 🟩 17 Pass: 100%/34 | Total: 4h 05m | Avg: 7m 13s | Max: 25m 20s | Hits: 99%/6693 🟩 20 Pass: 100%/22 | Total: 2h 49m | Avg: 7m 41s | Max: 25m 49s | Hits: 99%/4462 -
🟩 pycuda: Pass: 100%/1 | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s
🟩 cpu 🟩 amd64 Pass: 100%/1 | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s 🟩 ctk 🟩 12.5 Pass: 100%/1 | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s 🟩 cudacxx 🟩 nvcc12.5 Pass: 100%/1 | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s 🟩 cudacxx_family 🟩 nvcc Pass: 100%/1 | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s 🟩 cxx 🟩 GCC13 Pass: 100%/1 | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s 🟩 cxx_family 🟩 GCC Pass: 100%/1 | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s 🟩 gpu 🟩 v100 Pass: 100%/1 | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s 🟩 jobs 🟩 Test Pass: 100%/1 | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s
👃 Inspect Changes
Modifications in project?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| pycuda | |
| CUDA C Core Library |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | pycuda |
| +/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 259)
| # | Runner |
|---|---|
| 186 | linux-amd64-cpu16 |
| 42 | linux-amd64-gpu-v100-latest-1 |
| 16 | linux-arm64-cpu16 |
| 15 | windows-amd64-cpu16 |
🟨 CI finished in 2h 35m: Pass: 97%/259 | Total: 5d 01h | Avg: 28m 11s | Max: 1h 41m | Hits: 99%/20079
-
🟨 cub: Pass: 95%/136 | Total: 4d 12h | Avg: 47m 51s | Max: 1h 41m
🔍 cpu: amd64 🔍 🔍 amd64 Pass: 95%/128 | Total: 4d 05h | Avg: 47m 22s | Max: 1h 41m 🟩 arm64 Pass: 100%/8 | Total: 7h 24m | Avg: 55m 32s | Max: 59m 13s 🔍 cudacxx_family: nvcc 🔍 🟩 ClangCUDA Pass: 100%/2 | Total: 2h 03m | Avg: 1h 01m | Max: 1h 04m 🔍 nvcc Pass: 95%/134 | Total: 4d 10h | Avg: 47m 38s | Max: 1h 41m 🚨 cxx_family: MSVC 🚨 🟩 Clang Pass: 100%/63 | Total: 2d 08h | Avg: 53m 43s | Max: 1h 31m 🟩 GCC Pass: 100%/64 | Total: 2d 00h | Avg: 45m 26s | Max: 1h 41m 🟩 Intel Pass: 100%/3 | Total: 2h 17m | Avg: 45m 46s | Max: 46m 50s 🔥 MSVC Pass: 0%/6 | Total: 1h 18m | Avg: 13m 04s | Max: 14m 25s 🔍 jobs: Build 🔍 🔍 Build Pass: 94%/103 | Total: 2d 13h | Avg: 35m 50s | Max: 1h 04m 🟩 DeviceLaunch Pass: 100%/8 | Total: 11h 17m | Avg: 1h 24m | Max: 1h 28m 🟩 GraphCapture Pass: 100%/8 | Total: 10h 49m | Avg: 1h 21m | Max: 1h 25m 🟩 HostLaunch Pass: 100%/8 | Total: 11h 03m | Avg: 1h 22m | Max: 1h 26m 🟩 SmallGMem Pass: 100%/1 | Total: 1h 41m | Avg: 1h 41m | Max: 1h 41m 🟩 TestGPU Pass: 100%/8 | Total: 12h 04m | Avg: 1h 30m | Max: 1h 32m 🟨 ctk 🟨 11.1 Pass: 93%/15 | Total: 1h 04m | Avg: 4m 17s | Max: 14m 25s 🟩 11.8 Pass: 100%/3 | Total: 13m 53s | Avg: 4m 37s | Max: 4m 53s 🟨 12.6 Pass: 95%/118 | Total: 4d 11h | Avg: 54m 29s | Max: 1h 41m 🟨 cudacxx 🟩 ClangCUDA18 Pass: 100%/2 | Total: 2h 03m | Avg: 1h 01m | Max: 1h 04m 🟨 nvcc11.1 Pass: 93%/15 | Total: 1h 04m | Avg: 4m 17s | Max: 14m 25s 🟩 nvcc11.8 Pass: 100%/3 | Total: 13m 53s | Avg: 4m 37s | Max: 4m 53s 🟨 nvcc12.6 Pass: 95%/116 | Total: 4d 09h | Avg: 54m 22s | Max: 1h 41m 🟨 cxx 🟩 Clang9 Pass: 100%/6 | Total: 2h 26m | Avg: 24m 27s | Max: 47m 10s 🟩 Clang10 Pass: 100%/3 | Total: 2h 15m | Avg: 45m 06s | Max: 47m 07s 🟩 Clang11 Pass: 100%/4 | Total: 3h 00m | Avg: 45m 14s | Max: 47m 46s 🟩 Clang12 Pass: 100%/4 | Total: 2h 53m | Avg: 43m 18s | Max: 43m 58s 🟩 Clang13 Pass: 100%/4 | Total: 2h 57m | Avg: 44m 29s | Max: 48m 16s 🟩 Clang14 Pass: 100%/4 | Total: 2h 54m | Avg: 43m 32s | Max: 44m 01s 🟩 Clang15 Pass: 100%/4 | Total: 3h 00m | Avg: 45m 13s | Max: 48m 18s 🟩 Clang16 Pass: 100%/4 | Total: 2h 58m | Avg: 44m 33s | Max: 49m 00s 🟩 Clang17 Pass: 100%/4 | Total: 2h 52m | Avg: 43m 14s | Max: 43m 49s 🟩 Clang18 Pass: 100%/26 | Total: 1d 07h | Avg: 1h 11m | Max: 1h 31m 🟩 GCC6 Pass: 100%/2 | Total: 7m 05s | Avg: 3m 32s | Max: 3m 36s 🟩 GCC7 Pass: 100%/6 | Total: 2h 20m | Avg: 23m 28s | Max: 44m 26s 🟩 GCC8 Pass: 100%/6 | Total: 2h 19m | Avg: 23m 13s | Max: 43m 13s 🟩 GCC9 Pass: 100%/6 | Total: 2h 21m | Avg: 23m 34s | Max: 44m 16s 🟩 GCC10 Pass: 100%/4 | Total: 3h 01m | Avg: 45m 27s | Max: 47m 37s 🟩 GCC11 Pass: 100%/7 | Total: 3h 09m | Avg: 27m 05s | Max: 45m 18s 🟩 GCC12 Pass: 100%/4 | Total: 2h 54m | Avg: 43m 40s | Max: 44m 17s 🟩 GCC13 Pass: 100%/29 | Total: 1d 08h | Avg: 1h 06m | Max: 1h 41m 🟩 Intel2023.2.0 Pass: 100%/3 | Total: 2h 17m | Avg: 45m 46s | Max: 46m 50s 🟥 MSVC14.16 Pass: 0%/1 | Total: 14m 25s | Avg: 14m 25s | Max: 14m 25s 🟥 MSVC14.29 Pass: 0%/2 | Total: 25m 57s | Avg: 12m 58s | Max: 13m 57s 🟥 MSVC14.39 Pass: 0%/3 | Total: 38m 05s | Avg: 12m 41s | Max: 13m 01s 🟨 std 🟩 11 Pass: 100%/35 | Total: 1d 02h | Avg: 45m 54s | Max: 1h 31m 🟨 14 Pass: 92%/38 | Total: 1d 03h | Avg: 43m 46s | Max: 1h 31m 🟨 17 Pass: 94%/38 | Total: 1d 06h | Avg: 47m 29s | Max: 1h 41m 🟨 20 Pass: 96%/25 | Total: 23h 53m | Avg: 57m 20s | Max: 1h 32m 🟨 gpu 🟨 v100 Pass: 95%/136 | Total: 4d 12h | Avg: 47m 51s | Max: 1h 41m 🟩 sm 🟩 60;70;80;90 Pass: 100%/3 | Total: 13m 53s | Avg: 4m 37s | Max: 4m 53s 🟩 90a Pass: 100%/4 | Total: 40m 53s | Avg: 10m 13s | Max: 14m 20s -
🟩 thrust: Pass: 100%/122 | Total: 12h 58m | Avg: 6m 22s | Max: 25m 02s | Hits: 99%/20079
🟩 cpu 🟩 amd64 Pass: 100%/114 | Total: 12h 23m | Avg: 6m 31s | Max: 25m 02s | Hits: 99%/20079 🟩 arm64 Pass: 100%/8 | Total: 34m 44s | Avg: 4m 20s | Max: 4m 53s 🟩 ctk 🟩 11.1 Pass: 100%/15 | Total: 1h 10m | Avg: 4m 40s | Max: 18m 21s | Hits: 99%/2231 🟩 11.8 Pass: 100%/3 | Total: 14m 34s | Avg: 4m 51s | Max: 5m 31s 🟩 12.6 Pass: 100%/104 | Total: 11h 33m | Avg: 6m 40s | Max: 25m 02s | Hits: 99%/17848 🟩 cudacxx 🟩 ClangCUDA18 Pass: 100%/2 | Total: 9m 15s | Avg: 4m 37s | Max: 4m 47s 🟩 nvcc11.1 Pass: 100%/15 | Total: 1h 10m | Avg: 4m 40s | Max: 18m 21s | Hits: 99%/2231 🟩 nvcc11.8 Pass: 100%/3 | Total: 14m 34s | Avg: 4m 51s | Max: 5m 31s 🟩 nvcc12.6 Pass: 100%/102 | Total: 11h 24m | Avg: 6m 42s | Max: 25m 02s | Hits: 99%/17848 🟩 cudacxx_family 🟩 ClangCUDA Pass: 100%/2 | Total: 9m 15s | Avg: 4m 37s | Max: 4m 47s 🟩 nvcc Pass: 100%/120 | Total: 12h 49m | Avg: 6m 24s | Max: 25m 02s | Hits: 99%/20079 🟩 cxx 🟩 Clang9 Pass: 100%/6 | Total: 27m 33s | Avg: 4m 35s | Max: 5m 30s 🟩 Clang10 Pass: 100%/3 | Total: 17m 02s | Avg: 5m 40s | Max: 6m 05s 🟩 Clang11 Pass: 100%/4 | Total: 18m 49s | Avg: 4m 42s | Max: 5m 01s 🟩 Clang12 Pass: 100%/4 | Total: 18m 24s | Avg: 4m 36s | Max: 4m 50s 🟩 Clang13 Pass: 100%/4 | Total: 18m 13s | Avg: 4m 33s | Max: 5m 09s 🟩 Clang14 Pass: 100%/4 | Total: 18m 14s | Avg: 4m 33s | Max: 5m 03s 🟩 Clang15 Pass: 100%/4 | Total: 19m 02s | Avg: 4m 45s | Max: 5m 01s 🟩 Clang16 Pass: 100%/4 | Total: 18m 31s | Avg: 4m 37s | Max: 4m 44s 🟩 Clang17 Pass: 100%/4 | Total: 18m 15s | Avg: 4m 33s | Max: 4m 51s 🟩 Clang18 Pass: 100%/18 | Total: 2h 06m | Avg: 7m 01s | Max: 15m 32s 🟩 GCC6 Pass: 100%/2 | Total: 6m 51s | Avg: 3m 25s | Max: 3m 33s 🟩 GCC7 Pass: 100%/6 | Total: 23m 06s | Avg: 3m 51s | Max: 4m 23s 🟩 GCC8 Pass: 100%/6 | Total: 24m 29s | Avg: 4m 04s | Max: 4m 46s 🟩 GCC9 Pass: 100%/6 | Total: 24m 27s | Avg: 4m 04s | Max: 5m 02s 🟩 GCC10 Pass: 100%/4 | Total: 17m 24s | Avg: 4m 21s | Max: 4m 39s 🟩 GCC11 Pass: 100%/7 | Total: 33m 24s | Avg: 4m 46s | Max: 5m 31s 🟩 GCC12 Pass: 100%/4 | Total: 19m 11s | Avg: 4m 47s | Max: 5m 08s 🟩 GCC13 Pass: 100%/20 | Total: 2h 18m | Avg: 6m 54s | Max: 15m 11s 🟩 Intel2023.2.0 Pass: 100%/3 | Total: 17m 10s | Avg: 5m 43s | Max: 5m 57s 🟩 MSVC14.16 Pass: 100%/1 | Total: 18m 21s | Avg: 18m 21s | Max: 18m 21s | Hits: 99%/2231 🟩 MSVC14.29 Pass: 100%/2 | Total: 33m 52s | Avg: 16m 56s | Max: 18m 05s | Hits: 99%/4462 🟩 MSVC14.39 Pass: 100%/6 | Total: 2h 01m | Avg: 20m 17s | Max: 25m 02s | Hits: 99%/13386 🟩 cxx_family 🟩 Clang Pass: 100%/55 | Total: 5h 00m | Avg: 5m 27s | Max: 15m 32s 🟩 GCC Pass: 100%/55 | Total: 4h 46m | Avg: 5m 13s | Max: 15m 11s 🟩 Intel Pass: 100%/3 | Total: 17m 10s | Avg: 5m 43s | Max: 5m 57s 🟩 MSVC Pass: 100%/9 | Total: 2h 53m | Avg: 19m 19s | Max: 25m 02s | Hits: 99%/20079 🟩 gpu 🟩 v100 Pass: 100%/122 | Total: 12h 58m | Avg: 6m 22s | Max: 25m 02s | Hits: 99%/20079 🟩 jobs 🟩 Build Pass: 100%/103 | Total: 9h 02m | Avg: 5m 15s | Max: 18m 21s | Hits: 99%/13386 🟩 TestCPU Pass: 100%/11 | Total: 2h 04m | Avg: 11m 18s | Max: 25m 02s | Hits: 99%/6693 🟩 TestGPU Pass: 100%/8 | Total: 1h 51m | Avg: 13m 58s | Max: 15m 32s 🟩 sm 🟩 60;70;80;90 Pass: 100%/3 | Total: 14m 34s | Avg: 4m 51s | Max: 5m 31s 🟩 90a Pass: 100%/4 | Total: 15m 40s | Avg: 3m 55s | Max: 4m 15s 🟩 std 🟩 11 Pass: 100%/31 | Total: 2h 26m | Avg: 4m 43s | Max: 10m 45s 🟩 14 Pass: 100%/35 | Total: 3h 58m | Avg: 6m 49s | Max: 21m 38s | Hits: 99%/8924 🟩 17 Pass: 100%/34 | Total: 3h 52m | Avg: 6m 50s | Max: 22m 44s | Hits: 99%/6693 🟩 20 Pass: 100%/22 | Total: 2h 40m | Avg: 7m 18s | Max: 25m 02s | Hits: 99%/4462 -
🟩 pycuda: Pass: 100%/1 | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s
🟩 cpu 🟩 amd64 Pass: 100%/1 | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s 🟩 ctk 🟩 12.5 Pass: 100%/1 | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s 🟩 cudacxx 🟩 nvcc12.5 Pass: 100%/1 | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s 🟩 cudacxx_family 🟩 nvcc Pass: 100%/1 | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s 🟩 cxx 🟩 GCC13 Pass: 100%/1 | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s 🟩 cxx_family 🟩 GCC Pass: 100%/1 | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s 🟩 gpu 🟩 v100 Pass: 100%/1 | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s 🟩 jobs 🟩 Test Pass: 100%/1 | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s
👃 Inspect Changes
Modifications in project?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| pycuda | |
| CUDA C Core Library |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| CUDA Experimental | |
| +/- | pycuda |
| +/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 259)
| # | Runner |
|---|---|
| 186 | linux-amd64-cpu16 |
| 42 | linux-amd64-gpu-v100-latest-1 |
| 16 | linux-arm64-cpu16 |
| 15 | windows-amd64-cpu16 |
Performance Results of thrust::count_if vs cub::DeviceFind::FindIf (runs with I8 input type to test the performance of the newly introduced vectorized loads)
## [0] NVIDIA H200
| T | Elements | MismatchAt | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff |
| --- | -------- | ---------- | --------- | --------- | --------- | --------- | --------- | ------- |
| I8 | 2^16 | 1 | 19.937 us | 9.99% | 10.860 us | 3.43% | -9.077 us | -45.53% |
| I8 | 2^20 | 1 | 20.740 us | 2.16% | 13.999 us | 3.34% | -6.741 us | -32.50% |
| I8 | 2^24 | 1 | 31.782 us | 2.21% | 22.034 us | 1.84% | -9.718 us | -30.60% |
| I8 | 2^28 | 1 | 137.999 us| 6.49% | 177.938 us| 2.55% | 39.938 us | 28.94% |
| I8 | 2^16 | 0.5 | 19.962 us | 2.26% | 10.860 us | 2.51% | -9.002 us | -45.10% |
| I8 | 2^20 | 0.5 | 21.056 us | 2.79% | 13.979 us | 5.12% | -7.078 us | -33.61% |
| I8 | 2^24 | 0.5 | 31.876 us | 1.51% | 16.793 us | 2.42% | -15.083 us | -47.32% |
| I8 | 2^28 | 0.5 | 137.932 us| 0.54% | 96.730 us | 0.64% | -41.202 us | -29.87% |
| I8 | 2^16 | 0 | 20.016 us | 2.77% | 9.661 us | 2.99% | -10.375 us | -51.83% |
| I8 | 2^20 | 0 | 20.981 us | 2.70% | 10.298 us | 3.69% | -10.654 us | -50.85% |
| I8 | 2^24 | 0 | 31.743 us | 1.86% | 14.190 us | 2.16% | -17.553 us | -55.30% |
| I8 | 2^28 | 0 | 137.975 us| 0.49% | 14.186 us | 2.96% | -123.789 us| -89.72% |
When MismatchAt == 1, the predicated is satisfied only for the last element of the range. When MismatchA == 0 every element in the range satisfies the predicate. We see how the latter case is benefited from the newly introduced algorithm.
We also see how thrust::count_if is only better in the case where the whole range needs to be iterated (MismatchAt == 0) and the input size is big.
Some long awaiting performance results on A6000 and H200 (extending @gevtushenko's work in #1870):
Search Operation
cub::Device::FindIf, thrust::find_if and thrust::count_if are used as a backend to search for an int32 with key placed at a certain percentage of an input from 0 to 100%.
Equal Operation
cub::Device::FindIf, thrust::find_if and thrust::count_if are used as a backend to compare two vectors of int32 for equality with a mismatch placed at a certain percentage of input from 0 to 100%.
Many thanks to @elstehle for helping figure out this index!!!!
https://github.com/NVIDIA/cccl/blob/0364cf344c757b19366ba9f5a09448c8f0905867/cub/cub/agent/agent_find.cuh#L213-L217
Update: After refactoring the code by introducing Dispatch and Agent layers the benchmark results look the same on my A6000 local machine.
docs to be added over the weekend
😬 CI Workflow Results
🟥 Finished in 1h 11m: Pass: 95%/182 | Total: 1d 16h | Max: 1h 05m | Hits: 99%/189210
See results here.
😬 CI Workflow Results
🟥 Finished in 3h 24m: Pass: 94%/118 | Total: 1d 22h | Max: 3h 21m | Hits: 94%/160503
See results here.
This pull request requires additional validation before any workflows can run on NVIDIA's runners.
Pull request vetters can view their responsibilities here.
Contributors can view more details about this message here.
😬 CI Workflow Results
🟥 Finished in 4h 18m: Pass: 97%/124 | Total: 4d 02h | Max: 4h 18m | Hits: 86%/164264
See results here.
😬 CI Workflow Results
🟥 Finished in 4h 24m: Pass: 94%/124 | Total: 2d 07h | Max: 4h 23m | Hits: 96%/164264
See results here.
fresh out of the over results
😬 CI Workflow Results
🟥 Finished in 1h 32m: Pass: 14%/124 | Total: 2d 05h | Max: 1h 31m | Hits: 88%/20152
See results here.
😬 CI Workflow Results
🟥 Finished in 50m 04s: Pass: 7%/136 | Total: 10h 27m | Max: 27m 05s | Hits: 91%/3852
See results here.
😬 CI Workflow Results
🟥 Finished in 2h 54m: Pass: 54%/136 | Total: 3d 04h | Max: 2h 35m | Hits: 77%/77214
See results here.