thrust Device-side launch of thrust::lower_bound is creating wrong results

#include <thrust/binary_search.h>
#include <thrust/execution_policy.h>
#include <stdio.h>

__global__ void lowerbound(float inp_val) {
    constexpr int size = 6;
    float a[size] = {0.1, 0.2, 0.4, 0.6, 0.8, 1.};
    auto result = thrust::lower_bound(
        thrust::device, a, a + size, inp_val);
    printf("%ld\n", result - a);
}

int main() {
    lowerbound<<<1,1>>>(0.0);
    lowerbound<<<1,1>>>(0.1);
    lowerbound<<<1,1>>>(0.2);
    lowerbound<<<1,1>>>(0.3);
    lowerbound<<<1,1>>>(0.4);
    lowerbound<<<1,1>>>(0.5);
    cudaDeviceSynchronize();
}

I get

on CUDA 11.7 with the latest thrust

Jun 30 '22 19:06 zasdfgbnm

Related customer issue: https://github.com/pytorch/pytorch/pull/80714

Jun 30 '22 19:06 zasdfgbnm

This is likely the same issues as #1415.

Jun 30 '22 20:06 alliepiper

Wait, is the thrust::lower_bound here a CDP launch? From nsys nvprof, I only see one kernel, which is the one I wrote:

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)        Name       
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  -----------------
    100.0          347,426          6  57,904.3  56,032.0    54,560    69,057      5,541.6  lowerbound(float)

Jun 30 '22 22:06 zasdfgbnm

@zasdfgbnm CDP kernels don't show up in profiles

Jul 01 '22 09:07 gevtushenko

Is this thrust regression, compiler regression or CDP regression? If it's compiler regression it might show up in other circumstances, even if users are not calling thrust::lower_bound.

Jul 06 '22 17:07 ngimel

This is definitely not a cdp launch

when compiling, rdc=true -lcudadevrt isn't required
nsys shows all 6 kernels launched, whereas when cdp is used, not only does it not show the kernels that contain cdp launches, it doesn't show all subsequent regular kernels. So either thrust source changed leading to wrong results, or compiler regressed (not related to cdp).

Also, since it's not cdp, thrust shouldn't be deprecating it (?) and thus it should be fixed, as silent wrong results are the worst.

Jul 10 '22 22:07 ngimel

Updated the title (removed CDP launch info), prioritizing for the 2.1 milestone.

Jul 25 '22 22:07 alliepiper

In an offline discussion with @allisonvacanti I was surprised to learn that thrust::device is expected to work in device code when relocatable device code (rdc) is not enabled.

In this scenario, it is expected that the algorithm will behave equivalently to thrust::seq.

I verified that explicitly using thrust::seq in @zasdfgbnm's reproducer produces the correct results:

#include <thrust/binary_search.h>
#include <thrust/execution_policy.h>
#include <stdio.h>
#include <type_traits>

__global__ void lowerbound(float inp_val) {
    constexpr int size = 6;
    float a[size] = {0.1, 0.2, 0.4, 0.6, 0.8, 1.};
    auto result = thrust::lower_bound(thrust::seq, a, a + size, inp_val);
    printf("%ld\n", result - a);
}

int main() {
    lowerbound<<<1,1>>>(0.0);
    lowerbound<<<1,1>>>(0.1);
    lowerbound<<<1,1>>>(0.2);
    lowerbound<<<1,1>>>(0.3);
    lowerbound<<<1,1>>>(0.4);
    lowerbound<<<1,1>>>(0.5);
    cudaDeviceSynchronize();
}
...
0
0
1
2
2
3

So clearly something bad is happening in the translation of thrust::device to the equivalent of thrust::seq/thrust::generic.

Jul 25 '22 22:07 jrhemstad

I'm still baffled by the expectation that thrust::device inside a kernel does not imply CDP.

Looking at the generated PTX, invoking thrust::lower_bound(thrust::device,...) is definitely attempting to use CDP: https://godbolt.org/z/K7fMsPPKK

Jul 25 '22 23:07 jrhemstad

I think the issue is in the usage of cross-system copy_n: https://github.com/NVIDIA/thrust/blob/main/thrust/system/detail/generic/binary_search.inl#L153

For example, for lowerbound<<<1,1>>>(0.5); using thrust::device works if copy_n is replaced by ordinary assignments

  thrust::detail::temporary_array<T,DerivedPolicy>          d_value(exec,1);
  thrust::detail::temporary_array<OutputType,DerivedPolicy> d_output(exec,1);

  (*d_value.begin()) = value;

  // perform the query
  thrust::system::detail::generic::detail::binary_search(exec, begin, end, d_value.begin(), d_value.end(), d_output.begin(), comp, func);

  OutputType output;
  output = *d_output.begin();

  return output;

I also tried to add assert((*d_value.begin()) == value); after the first copy_n. This assertion fails in above test case when thrust::device is used.

Stepping through the code using cuda-gdb seems impossible because of an Illegal Instruction error during execution when compiled with -G

Aug 09 '22 10:08 fkallen

thrust thrust copied to clipboard

Device-side launch of thrust::lower_bound is creating wrong results

thrust
thrust copied to clipboard