cub icon indicating copy to clipboard operation
cub copied to clipboard

Refactor DeviceScan implementation to allow InclusiveScan/Sum to take an initial value

Open gevtushenko opened this issue 4 years ago • 5 comments

There was reported an issue regarding the internal accumulator type in cub::DeviceScan::InclusiveSum. The issue consists in using input data type as accumulator type. Here's the reproducer:

#include <thrust/device_vector.h>
#include <cub/device/device_scan.cuh>


int main()
{
    thrust::device_vector<std::uint8_t> d_in(260, 1);
    thrust::device_vector<std::uint64_t> d_out(260);

    std::size_t temp_bytes {};
    cub::DeviceScan::InclusiveSum(
            nullptr,
            temp_bytes,
            thrust::raw_pointer_cast(d_in.data()),
            thrust::raw_pointer_cast(d_out.data()),
            d_in.size());

    thrust::device_vector<std::uint8_t> temp_storage(temp_bytes);
    cub::DeviceScan::InclusiveSum(
            thrust::raw_pointer_cast(temp_storage.data()),
            temp_bytes,
            thrust::raw_pointer_cast(d_in.data()),
            thrust::raw_pointer_cast(d_out.data()),
            d_in.size());

    thrust::device_vector<std::uint64_t> d_ref(260);
    thrust::sequence(d_ref.begin(), d_ref.end(), 1);

    if (d_ref != d_out)
    {
        std::cerr << "Wrong result!" << std::endl;
    }

    return 0;
}

The naive solution would be to fix the output value type deduction. But this solution doesn't seem to conform STL:

std::vector<std::uint8_t> in(260, 1);
std::vector<std::uint64_t> out(260, 1);

// Uses the decltype(*in) as accumulator type
// std::inclusive_scan(in.begin(), in.end(), out.begin());
std::inclusive_scan(in.begin(), in.end(), out.begin(), std::plus<>(), std::uint64_t{0});

std::vector<std::uint64_t> reference(260);
std::iota(reference.begin(), reference.end(), 1);

for (std::size_t i = 0; i < reference.size(); i++)
{
    if (out[i] != reference[i])
    {
        std::cerr << out[i] << " != " << reference[i] << std::endl;
    }
}

Unlike STL we don't have a way to provide an InitValue into the algorithm. So, providing an InitValue might be a better way of addressing this issue.

gevtushenko avatar Oct 06 '21 16:10 gevtushenko

I think that captures it. Interesting enough, that when thrust::inclusive_scan was used, the output was correct, however, the switch was made to cub::DeviceScan because of its ability to take cuda stream argument. What would be a good way to wrkk around it? (using cuda 11.3). Lauching a kernel with a stream argument, just to run thrust::inclusive_scan(thrust::device, ...) would not be effective, would it? I have seen various examples of thrust taking a stream, but they do not seem to be documented and I could not figure out how to use them with my thrust headers.

yuslepukhin avatar Oct 06 '21 18:10 yuslepukhin

I also found this in the forums, implying that in 11.4 the problem may have been addressed.

yuslepukhin avatar Oct 06 '21 18:10 yuslepukhin

The accumulator type deduction is intentionally implemented to follow the recommendations in this proposal, which essentially boil down to:

  • If an initial value is supplied, use its type for the accumulator
  • Otherwise use the value_type of the input iterator.

Unfortunately, the current implementation does not easily support initial values in inclusive scans. The ideal fix here is to update the implementation to add a dedicated switch between inclusive / exclusive scan behaviors in the scan backend, rather than overloading the meaning of InitValueT.

In the meantime, you can work-around this by wrapping the input in a thrust::transform_iterator that simply casts to the desired accumulator type.

alliepiper avatar Oct 06 '21 21:10 alliepiper

@allisonvacanti This seems to work. Thank you! Any suggestion on how to tie this a specific cuda stream?

yuslepukhin avatar Oct 06 '21 22:10 yuslepukhin

Never mind, thank you!

yuslepukhin avatar Oct 06 '21 22:10 yuslepukhin