Fix BlockScan accumulator type handling
Summary
- keep
ThreadReduceaccumulator types pinned to the block value type acrossBlockScanandBlockReduce - apply the same accumulator fix to the raking specialization so all paths use the intended type
- add a regression test that exercises
BlockScanwith a functor returning a wider type
Motivation
#5668 shows that BlockScan widens the accumulator when the scan functor returns a wider type than the block value. That implicit widening breaks user code that relies on the original type and can even hit deleted overloads.
Explanation
ThreadReduce was deducing its accumulator type from the functor instead of the block value T. The patch explicitly instantiates ThreadReduce with AccumT = T everywhere BlockScan and BlockReduce dispatch through it, including the raking specialization. The new unit test exercises an operator that returns long long for int inputs and verifies the accumulator remains int.
Rationale
-
Minimal surface area: the change touches only the
ThreadReducecall sites; public APIs and template parameters stay the same. - Consistent behavior: every BlockScan reduction path now uses the same accumulator type, avoiding divergent code paths.
- Regression coverage: the new Catch2 test guards against future regressions triggered by wider returning ops.
Testing
-
pre-commit run --files cub/cub/block/block_scan.cuh cub/cub/block/block_reduce.cuh cub/cub/block/specializations/block_reduce_raking_commutative_only.cuh cub/test/catch2_test_block_scan.cu
This pull request requires additional validation before any workflows can run on NVIDIA's runners.
Pull request vetters can view their responsibilities here.
Contributors can view more details about this message here.
pre-commit.ci autofix
/ok to test 0bcd084
😬 CI Workflow Results
🟥 Finished in 1h 14m: Pass: 25%/81 | Total: 12h 55m | Max: 28m 46s | Hits: 89%/4136
See results here.
@fbusato I fixed thread_reduce_apply so the reduction functor is forwarded and invoked via cuda::std::invoke, which keeps the PreferredT casting while still honoring const functors (this unblocks Thrust’s key_flag_scan_op). I re-ran the full cub-cpp20 preset inside rapidsai/devcontainers:25.12-cpp-llvm20-cuda13.0; all 1,427 build/test/bench targets completed successfully. Could you please rerun CI when convenient?
/ok to test bbf0792
😬 CI Workflow Results
🟥 Finished in 4h 45m: Pass: 82%/81 | Total: 4d 03h | Max: 4h 44m | Hits: 51%/65936
See results here.