thrust
thrust copied to clipboard
CUDA thrust::lower_bound fails when give a custom output iterator and compiled with -G option [NVBug 3322776]
I believe this is a compiler issue since the problem only appears when using the -G option on nvcc. Unfortunately I'm not able to follow the thrust source code here well enough to see where the problem occurs. I've attached a smallish testcase that can reproduce the error consistently.
lb_output_itr.cu source file to reproduce the error
#include <thrust/binary_search.h>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>
#include <thrust/host_vector.h>
#include <iostream>
#include <type_traits>
#include <vector>
// build failing one using:
// nvcc -std=c++17 -g -G lb_output_itr.cu -o lb_output_itr
//
// build non-failing one by just removing the -G option
struct output_indexalator {
using difference_type = ptrdiff_t;
using value_type = int;
using pointer = int *;
using iterator_category = std::random_access_iterator_tag;
using reference = output_indexalator const &;
output_indexalator() = default;
output_indexalator(output_indexalator const &) = default;
output_indexalator(output_indexalator &&) = default;
output_indexalator &operator=(output_indexalator const &) = default;
output_indexalator &operator=(output_indexalator &&) = default;
__host__ __device__ output_indexalator &operator+=(difference_type offset)
{
p_ += offset * width_;
return *this;
}
__host__ __device__ output_indexalator operator+(difference_type offset) const
{
auto tmp = output_indexalator(*this);
tmp.p_ += (offset * width_);
return tmp;
}
__device__ output_indexalator const operator[](int idx) const
{
output_indexalator tmp{*this};
tmp.p_ += (idx * width_);
return tmp;
}
__device__ output_indexalator const &operator*() const
{
printf("%p: op*() %p\n", this, p_);
return *this;
}
__device__ output_indexalator const &operator=(int const value) const
{
void *tp = p_;
printf("%p: set(%p,%d)\n", this, tp, value); // class data is garbage
// crashes here:
(*static_cast<int *>(tp)) = static_cast<int>(value);
return *this;
}
output_indexalator(void *data, int width) : width_(width), p_{static_cast<char *>(data)} {}
int width_; /// integer type width = 1,2,4, or 8
char *p_; /// pointer to the integer data in device memory
};
template <typename T>
thrust::device_vector<T> make_device_vector(std::vector<T> data)
{
thrust::host_vector<T> h_data{data.begin(), data.end()};
thrust::device_vector<T> d_data{h_data};
return d_data;
}
int main(void)
{
std::vector<int> input{0, 1, 2, 3, 4};
auto d_input = make_device_vector(input);
std::vector<int> values{9, 1, 0, 2};
auto d_values = make_device_vector(values);
thrust::device_vector<int> output(d_values.size());
auto itr = output_indexalator(output.data().get(), sizeof(int));
thrust::lower_bound(thrust::device,
d_input.begin(),
d_input.end(),
d_values.begin(),
d_values.end(),
itr, // output.begin(),
thrust::less<long>());
thrust::host_vector<int> h_output(output);
for (auto v : h_output) std::cout << " " << v;
std::cout << std::endl;
return 0;
}
Compile the source file using the following command:
nvcc -std=c++17 -g -G lb_output_itr.cu -o lb_output_itr
Running the resulting lb_output_itr
executable gives the following result:
$ ./lb_output_itr
0x7f65eefffbd0: op*() 0x7f65b7e00400
0x7f65eefffbd0: op*() 0x7f65b7e00404
0x7f65eefffbd0: op*() 0x7f65b7e00408
0x7f65eefffbd0: op*() 0x7f65b7e0040c
0x7f65eefffbd0: set(0x400000000,5)
0x7f65eefffbd0: set(0x400000000,1)
0x7f65eefffbd0: set(0x400000000,0)
0x7f65eefffbd0: set(0x400000000,2)
terminate called after throwing an instance of 'thrust::system::system_error'
what(): for_each: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
Aborted
The 0x400000000
should be the device pointer but the iterator object is getting trashed somewhere.
Building without the -G
option will produce the correct result:
$ nvcc -std=c++17 -g -G lb_output_itr.cu -o lb_output_itr
$ ./lb_output_itr
0x7f745cfffcc8: op*() 0x7f7427e00400
0x7f745cfffcc8: op*() 0x7f7427e00404
0x7f745cfffcc8: op*() 0x7f7427e00408
0x7f745cfffcc8: op*() 0x7f7427e0040c
0x7f745cfffcc8: set(0x7f7427e00400,5)
0x7f745cfffcc8: set(0x7f7427e00404,1)
0x7f745cfffcc8: set(0x7f7427e00408,0)
0x7f745cfffcc8: set(0x7f7427e0040c,2)
5 1 0 2
The output_indexalator
iterator being used here is a simplified version from a much larger set of code and has been pared down to provide a minimal reproducer for this issue.
I've verified the error occurs with the same results on my Linux 18.04 system with the following nvcc compiler versions (and associated thrust versions): V11.0.221, V11.1.105, V11.2.142, and V11.3.109
Took a quick look at this in the debugger it appears that memory is getting corrupt -- the parallel_for
algorithm's tile_base
is consistently set to 0, but is a very large negative value by the time the crash happens. I haven't found a reason for this, and it's odd that it's only in debugging mode.
@davidwendt Has the RAPIDS team filed an nvcc bug for this?
No, not yet. I wanted to get some help from you on creating the details for nvcc bug since I don't know what is happening. Do you think the information here is enough to open an nvcc bug?
Ok, just wanted to make sure. I'm planning to spend a bit more time looking at this in case it is in our libraries before we escalate.
I spent a couple more hours looking into this, and things seem to go off the rails around this line. The lhs of the assignment (thrust::get<1>(t)
) is producing a null reference to an output_indexalator
, but only in debug mode -- it is a valid object otherwise. This is why the class data is invalid during the output_indexalator::operator=
method.
This tuple is produced from a zip_iterator
of values_begin
and output
here.
I can't see anything going wrong in the source code, so this does seem like a compiler bug. I've filed NVBug 3322776 to have the compiler folks check it out.
This is a tricky one. The root issue is that for output_indexalator
, which is a random access iterator, operator*
returns by reference but operator[]
returns by value. This is allowed by the C++17 iterator concepts, but the Thrust for_each[_n]
algorithms are not expecting this.
template <class Input, class UnaryOp>
struct for_each_f
{
Input input;
UnaryOp op;
THRUST_FUNCTION
for_each_f(Input input, UnaryOp op)
: input(input), op(op) {}
template <class Size>
THRUST_DEVICE_FUNCTION void operator()(Size idx)
{
op(raw_reference_cast(input[idx])); // HERE
}
};
This has a bad interaction with zip_iterator
's dereference_iterator
callable leading to a function returning a reference to a local.
The fix is quite simple: Thrust should avoid using operator[]
on random access iterators. This eliminates the crash:
template <class Size>
THRUST_DEVICE_FUNCTION void operator()(Size idx)
{
op(raw_reference_cast(*(input + idx)));
}