thrust icon indicating copy to clipboard operation
thrust copied to clipboard

Device-side launch of thrust::lower_bound is creating wrong results

Open zasdfgbnm opened this issue 2 years ago • 10 comments

#include <thrust/binary_search.h>
#include <thrust/execution_policy.h>
#include <stdio.h>

__global__ void lowerbound(float inp_val) {
    constexpr int size = 6;
    float a[size] = {0.1, 0.2, 0.4, 0.6, 0.8, 1.};
    auto result = thrust::lower_bound(
        thrust::device, a, a + size, inp_val);
    printf("%ld\n", result - a);
}

int main() {
    lowerbound<<<1,1>>>(0.0);
    lowerbound<<<1,1>>>(0.1);
    lowerbound<<<1,1>>>(0.2);
    lowerbound<<<1,1>>>(0.3);
    lowerbound<<<1,1>>>(0.4);
    lowerbound<<<1,1>>>(0.5);
    cudaDeviceSynchronize();
}

I get

0
0
0
0
0
0

on CUDA 11.7 with the latest thrust

zasdfgbnm avatar Jun 30 '22 19:06 zasdfgbnm

Related customer issue: https://github.com/pytorch/pytorch/pull/80714

zasdfgbnm avatar Jun 30 '22 19:06 zasdfgbnm

This is likely the same issues as #1415.

alliepiper avatar Jun 30 '22 20:06 alliepiper

Wait, is the thrust::lower_bound here a CDP launch? From nsys nvprof, I only see one kernel, which is the one I wrote:

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)        Name       
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  -----------------
    100.0          347,426          6  57,904.3  56,032.0    54,560    69,057      5,541.6  lowerbound(float)

zasdfgbnm avatar Jun 30 '22 22:06 zasdfgbnm

@zasdfgbnm CDP kernels don't show up in profiles

gevtushenko avatar Jul 01 '22 09:07 gevtushenko

Is this thrust regression, compiler regression or CDP regression? If it's compiler regression it might show up in other circumstances, even if users are not calling thrust::lower_bound.

ngimel avatar Jul 06 '22 17:07 ngimel

This is definitely not a cdp launch

  1. when compiling, rdc=true -lcudadevrt isn't required
  2. nsys shows all 6 kernels launched, whereas when cdp is used, not only does it not show the kernels that contain cdp launches, it doesn't show all subsequent regular kernels. So either thrust source changed leading to wrong results, or compiler regressed (not related to cdp).

Also, since it's not cdp, thrust shouldn't be deprecating it (?) and thus it should be fixed, as silent wrong results are the worst.

ngimel avatar Jul 10 '22 22:07 ngimel

Updated the title (removed CDP launch info), prioritizing for the 2.1 milestone.

alliepiper avatar Jul 25 '22 22:07 alliepiper

In an offline discussion with @allisonvacanti I was surprised to learn that thrust::device is expected to work in device code when relocatable device code (rdc) is not enabled.

In this scenario, it is expected that the algorithm will behave equivalently to thrust::seq.

I verified that explicitly using thrust::seq in @zasdfgbnm's reproducer produces the correct results:

#include <thrust/binary_search.h>
#include <thrust/execution_policy.h>
#include <stdio.h>
#include <type_traits>

__global__ void lowerbound(float inp_val) {
    constexpr int size = 6;
    float a[size] = {0.1, 0.2, 0.4, 0.6, 0.8, 1.};
    auto result = thrust::lower_bound(thrust::seq, a, a + size, inp_val);
    printf("%ld\n", result - a);
}

int main() {
    lowerbound<<<1,1>>>(0.0);
    lowerbound<<<1,1>>>(0.1);
    lowerbound<<<1,1>>>(0.2);
    lowerbound<<<1,1>>>(0.3);
    lowerbound<<<1,1>>>(0.4);
    lowerbound<<<1,1>>>(0.5);
    cudaDeviceSynchronize();
}
...
0
0
1
2
2
3

So clearly something bad is happening in the translation of thrust::device to the equivalent of thrust::seq/thrust::generic.

jrhemstad avatar Jul 25 '22 22:07 jrhemstad

I'm still baffled by the expectation that thrust::device inside a kernel does not imply CDP.

Looking at the generated PTX, invoking thrust::lower_bound(thrust::device,...) is definitely attempting to use CDP: https://godbolt.org/z/K7fMsPPKK

jrhemstad avatar Jul 25 '22 23:07 jrhemstad

I think the issue is in the usage of cross-system copy_n: https://github.com/NVIDIA/thrust/blob/main/thrust/system/detail/generic/binary_search.inl#L153

For example, for lowerbound<<<1,1>>>(0.5); using thrust::device works if copy_n is replaced by ordinary assignments

  thrust::detail::temporary_array<T,DerivedPolicy>          d_value(exec,1);
  thrust::detail::temporary_array<OutputType,DerivedPolicy> d_output(exec,1);

  (*d_value.begin()) = value;

  // perform the query
  thrust::system::detail::generic::detail::binary_search(exec, begin, end, d_value.begin(), d_value.end(), d_output.begin(), comp, func);

  OutputType output;
  output = *d_output.begin();

  return output;

I also tried to add assert((*d_value.begin()) == value); after the first copy_n. This assertion fails in above test case when thrust::device is used.

Stepping through the code using cuda-gdb seems impossible because of an Illegal Instruction error during execution when compiled with -G

fkallen avatar Aug 09 '22 10:08 fkallen