cccl icon indicating copy to clipboard operation
cccl copied to clipboard

Fix BlockScan accumulator type handling

Open Aminsed opened this issue 2 months ago • 4 comments

Summary

  • keep ThreadReduce accumulator types pinned to the block value type across BlockScan and BlockReduce
  • apply the same accumulator fix to the raking specialization so all paths use the intended type
  • add a regression test that exercises BlockScan with a functor returning a wider type

Motivation

#5668 shows that BlockScan widens the accumulator when the scan functor returns a wider type than the block value. That implicit widening breaks user code that relies on the original type and can even hit deleted overloads.

Explanation

ThreadReduce was deducing its accumulator type from the functor instead of the block value T. The patch explicitly instantiates ThreadReduce with AccumT = T everywhere BlockScan and BlockReduce dispatch through it, including the raking specialization. The new unit test exercises an operator that returns long long for int inputs and verifies the accumulator remains int.

Rationale

  • Minimal surface area: the change touches only the ThreadReduce call sites; public APIs and template parameters stay the same.
  • Consistent behavior: every BlockScan reduction path now uses the same accumulator type, avoiding divergent code paths.
  • Regression coverage: the new Catch2 test guards against future regressions triggered by wider returning ops.

Testing

  • pre-commit run --files cub/cub/block/block_scan.cuh cub/cub/block/block_reduce.cuh cub/cub/block/specializations/block_reduce_raking_commutative_only.cuh cub/test/catch2_test_block_scan.cu

Aminsed avatar Nov 02 '25 17:11 Aminsed

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

copy-pr-bot[bot] avatar Nov 02 '25 17:11 copy-pr-bot[bot]

pre-commit.ci autofix

fbusato avatar Nov 10 '25 20:11 fbusato

/ok to test 0bcd084

fbusato avatar Nov 10 '25 20:11 fbusato

😬 CI Workflow Results

🟥 Finished in 1h 14m: Pass: 25%/81 | Total: 12h 55m | Max: 28m 46s | Hits: 89%/4136

See results here.

github-actions[bot] avatar Nov 10 '25 22:11 github-actions[bot]

@fbusato I fixed thread_reduce_apply so the reduction functor is forwarded and invoked via cuda::std::invoke, which keeps the PreferredT casting while still honoring const functors (this unblocks Thrust’s key_flag_scan_op). I re-ran the full cub-cpp20 preset inside rapidsai/devcontainers:25.12-cpp-llvm20-cuda13.0; all 1,427 build/test/bench targets completed successfully. Could you please rerun CI when convenient?

Aminsed avatar Nov 16 '25 18:11 Aminsed

/ok to test bbf0792

fbusato avatar Nov 17 '25 21:11 fbusato

😬 CI Workflow Results

🟥 Finished in 4h 45m: Pass: 82%/81 | Total: 4d 03h | Max: 4h 44m | Hits: 51%/65936

See results here.

github-actions[bot] avatar Nov 18 '25 02:11 github-actions[bot]