Give inclusive_scan an overload with init
It's difficult to implement scan and other sequences of scan or sums otherwise. The init parameter can accept the "carry" of the previous sum. inclusive_scan without init can easily be implemented by the more general overload by passing the initial input element as the init and having the first thread copy to the first element of the output.
### Tasks
- [x] Add cub::BlockScan::InclusiveScan array based APIs
- [x] Add DeviceScan::InclusiveScan APIs
- [ ] https://github.com/NVIDIA/cccl/issues/2146
- [ ] Add cub::BlockScan::InclusiveScan value based APIs
This will subtly change the semantics of inclusive_scan without init, but we should do it anyway to match N3724.
Currently in Thrust, if the intermediate_type differs from the input & output types, then the first element of the result will go through a conversion:
intermediate_type sum = *first;
*result = sum;
for(++first, ++result; first != last; ++first, ++result)
*result = sum = binary_op(sum,*first);
After giving inclusive_scan an init parameter, the semantics of inclusive_scan would be
inclusive_scan(..., T init, ...)
{
*result = init;
for(++first, ++result; first != last; ++first, ++result)
*result = init = binary_op(init,*first);
}
In other words, the intermediate_type is simply the type of init, as in exclusive_scan.
Better wait and see what syntax shakes out of Parallelism TS
To fully implement stdpar, nvc++ needs the overloads of inclusive_scan and transform_inclusive_scan that take an initial value.
We received our first bug report from an end user about this. The lack of this function in Thrust means one of the overloads of std::inclusive_scan cannot be parallelized by nvc++ -stdpar.
Any updates or ETA on this?
@brycelelbach incoming