thrust
thrust copied to clipboard
Device-side launch of thrust::lower_bound is creating wrong results
#include <thrust/binary_search.h>
#include <thrust/execution_policy.h>
#include <stdio.h>
__global__ void lowerbound(float inp_val) {
constexpr int size = 6;
float a[size] = {0.1, 0.2, 0.4, 0.6, 0.8, 1.};
auto result = thrust::lower_bound(
thrust::device, a, a + size, inp_val);
printf("%ld\n", result - a);
}
int main() {
lowerbound<<<1,1>>>(0.0);
lowerbound<<<1,1>>>(0.1);
lowerbound<<<1,1>>>(0.2);
lowerbound<<<1,1>>>(0.3);
lowerbound<<<1,1>>>(0.4);
lowerbound<<<1,1>>>(0.5);
cudaDeviceSynchronize();
}
I get
0
0
0
0
0
0
on CUDA 11.7 with the latest thrust
Related customer issue: https://github.com/pytorch/pytorch/pull/80714
This is likely the same issues as #1415.
Wait, is the thrust::lower_bound
here a CDP launch? From nsys nvprof
, I only see one kernel, which is the one I wrote:
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- -------- -------- -------- -------- ----------- -----------------
100.0 347,426 6 57,904.3 56,032.0 54,560 69,057 5,541.6 lowerbound(float)
@zasdfgbnm CDP kernels don't show up in profiles
Is this thrust regression, compiler regression or CDP regression? If it's compiler regression it might show up in other circumstances, even if users are not calling thrust::lower_bound
.
This is definitely not a cdp launch
- when compiling,
rdc=true -lcudadevrt
isn't required - nsys shows all 6 kernels launched, whereas when cdp is used, not only does it not show the kernels that contain cdp launches, it doesn't show all subsequent regular kernels. So either thrust source changed leading to wrong results, or compiler regressed (not related to cdp).
Also, since it's not cdp, thrust shouldn't be deprecating it (?) and thus it should be fixed, as silent wrong results are the worst.
Updated the title (removed CDP launch info), prioritizing for the 2.1 milestone.
In an offline discussion with @allisonvacanti I was surprised to learn that thrust::device
is expected to work in device code when relocatable device code (rdc) is not enabled.
In this scenario, it is expected that the algorithm will behave equivalently to thrust::seq
.
I verified that explicitly using thrust::seq
in @zasdfgbnm's reproducer produces the correct results:
#include <thrust/binary_search.h>
#include <thrust/execution_policy.h>
#include <stdio.h>
#include <type_traits>
__global__ void lowerbound(float inp_val) {
constexpr int size = 6;
float a[size] = {0.1, 0.2, 0.4, 0.6, 0.8, 1.};
auto result = thrust::lower_bound(thrust::seq, a, a + size, inp_val);
printf("%ld\n", result - a);
}
int main() {
lowerbound<<<1,1>>>(0.0);
lowerbound<<<1,1>>>(0.1);
lowerbound<<<1,1>>>(0.2);
lowerbound<<<1,1>>>(0.3);
lowerbound<<<1,1>>>(0.4);
lowerbound<<<1,1>>>(0.5);
cudaDeviceSynchronize();
}
...
0
0
1
2
2
3
So clearly something bad is happening in the translation of thrust::device
to the equivalent of thrust::seq/thrust::generic
.
I'm still baffled by the expectation that thrust::device
inside a kernel does not imply CDP.
Looking at the generated PTX, invoking thrust::lower_bound(thrust::device,...)
is definitely attempting to use CDP: https://godbolt.org/z/K7fMsPPKK
I think the issue is in the usage of cross-system copy_n: https://github.com/NVIDIA/thrust/blob/main/thrust/system/detail/generic/binary_search.inl#L153
For example, for lowerbound<<<1,1>>>(0.5);
using thrust::device
works if copy_n is replaced by ordinary assignments
thrust::detail::temporary_array<T,DerivedPolicy> d_value(exec,1);
thrust::detail::temporary_array<OutputType,DerivedPolicy> d_output(exec,1);
(*d_value.begin()) = value;
// perform the query
thrust::system::detail::generic::detail::binary_search(exec, begin, end, d_value.begin(), d_value.end(), d_output.begin(), comp, func);
OutputType output;
output = *d_output.begin();
return output;
I also tried to add assert((*d_value.begin()) == value);
after the first copy_n. This assertion fails in above test case when thrust::device
is used.
Stepping through the code using cuda-gdb seems impossible because of an Illegal Instruction error during execution when compiled with -G