cccl Fix BlockScan accumulator type handling

Summary

keep ThreadReduce accumulator types pinned to the block value type across BlockScan and BlockReduce
apply the same accumulator fix to the raking specialization so all paths use the intended type
add a regression test that exercises BlockScan with a functor returning a wider type

Motivation

#5668 shows that BlockScan widens the accumulator when the scan functor returns a wider type than the block value. That implicit widening breaks user code that relies on the original type and can even hit deleted overloads.

Explanation

ThreadReduce was deducing its accumulator type from the functor instead of the block value T. The patch explicitly instantiates ThreadReduce with AccumT = T everywhere BlockScan and BlockReduce dispatch through it, including the raking specialization. The new unit test exercises an operator that returns long long for int inputs and verifies the accumulator remains int.

Rationale

Minimal surface area: the change touches only the ThreadReduce call sites; public APIs and template parameters stay the same.
Consistent behavior: every BlockScan reduction path now uses the same accumulator type, avoiding divergent code paths.
Regression coverage: the new Catch2 test guards against future regressions triggered by wider returning ops.

Testing

pre-commit run --files cub/cub/block/block_scan.cuh cub/cub/block/block_reduce.cuh cub/cub/block/specializations/block_reduce_raking_commutative_only.cuh cub/test/catch2_test_block_scan.cu

Nov 02 '25 17:11 Aminsed

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Nov 02 '25 17:11 copy-pr-bot[bot]

pre-commit.ci autofix

Nov 10 '25 20:11 fbusato

/ok to test 0bcd084

Nov 10 '25 20:11 fbusato

😬 CI Workflow Results

🟥 Finished in 1h 14m: Pass: 25%/81 | Total: 12h 55m | Max: 28m 46s | Hits: 89%/4136

See results here.

Nov 10 '25 22:11 github-actions[bot]

@fbusato I fixed thread_reduce_apply so the reduction functor is forwarded and invoked via cuda::std::invoke, which keeps the PreferredT casting while still honoring const functors (this unblocks Thrust’s key_flag_scan_op). I re-ran the full cub-cpp20 preset inside rapidsai/devcontainers:25.12-cpp-llvm20-cuda13.0; all 1,427 build/test/bench targets completed successfully. Could you please rerun CI when convenient?

Nov 16 '25 18:11 Aminsed

/ok to test bbf0792

Nov 17 '25 21:11 fbusato

😬 CI Workflow Results

🟥 Finished in 4h 45m: Pass: 82%/81 | Total: 4d 03h | Max: 4h 44m | Hits: 51%/65936

See results here.

Nov 18 '25 02:11 github-actions[bot]