Refactor DeviceScan implementation to allow InclusiveScan/Sum to take an initial value
There was reported an issue regarding the internal accumulator type in cub::DeviceScan::InclusiveSum. The issue consists in using input data type as accumulator type. Here's the reproducer:
#include <thrust/device_vector.h>
#include <cub/device/device_scan.cuh>
int main()
{
thrust::device_vector<std::uint8_t> d_in(260, 1);
thrust::device_vector<std::uint64_t> d_out(260);
std::size_t temp_bytes {};
cub::DeviceScan::InclusiveSum(
nullptr,
temp_bytes,
thrust::raw_pointer_cast(d_in.data()),
thrust::raw_pointer_cast(d_out.data()),
d_in.size());
thrust::device_vector<std::uint8_t> temp_storage(temp_bytes);
cub::DeviceScan::InclusiveSum(
thrust::raw_pointer_cast(temp_storage.data()),
temp_bytes,
thrust::raw_pointer_cast(d_in.data()),
thrust::raw_pointer_cast(d_out.data()),
d_in.size());
thrust::device_vector<std::uint64_t> d_ref(260);
thrust::sequence(d_ref.begin(), d_ref.end(), 1);
if (d_ref != d_out)
{
std::cerr << "Wrong result!" << std::endl;
}
return 0;
}
The naive solution would be to fix the output value type deduction. But this solution doesn't seem to conform STL:
std::vector<std::uint8_t> in(260, 1);
std::vector<std::uint64_t> out(260, 1);
// Uses the decltype(*in) as accumulator type
// std::inclusive_scan(in.begin(), in.end(), out.begin());
std::inclusive_scan(in.begin(), in.end(), out.begin(), std::plus<>(), std::uint64_t{0});
std::vector<std::uint64_t> reference(260);
std::iota(reference.begin(), reference.end(), 1);
for (std::size_t i = 0; i < reference.size(); i++)
{
if (out[i] != reference[i])
{
std::cerr << out[i] << " != " << reference[i] << std::endl;
}
}
Unlike STL we don't have a way to provide an InitValue into the algorithm. So, providing an InitValue might be a better way of addressing this issue.
I think that captures it. Interesting enough, that when thrust::inclusive_scan was used, the output was correct, however, the switch was made to cub::DeviceScan because of its ability to take cuda stream argument. What would be a good way to wrkk around it? (using cuda 11.3). Lauching a kernel with a stream argument, just to run thrust::inclusive_scan(thrust::device, ...) would not be effective, would it? I have seen various examples of thrust taking a stream, but they do not seem to be documented and I could not figure out how to use them with my thrust headers.
I also found this in the forums, implying that in 11.4 the problem may have been addressed.
The accumulator type deduction is intentionally implemented to follow the recommendations in this proposal, which essentially boil down to:
- If an initial value is supplied, use its type for the accumulator
- Otherwise use the
value_typeof the input iterator.
Unfortunately, the current implementation does not easily support initial values in inclusive scans. The ideal fix here is to update the implementation to add a dedicated switch between inclusive / exclusive scan behaviors in the scan backend, rather than overloading the meaning of InitValueT.
In the meantime, you can work-around this by wrapping the input in a thrust::transform_iterator that simply casts to the desired accumulator type.
@allisonvacanti This seems to work. Thank you! Any suggestion on how to tie this a specific cuda stream?
Never mind, thank you!