thrust icon indicating copy to clipboard operation
thrust copied to clipboard

CUDA thrust::lower_bound fails when give a custom output iterator and compiled with -G option [NVBug 3322776]

Open davidwendt opened this issue 3 years ago • 4 comments

I believe this is a compiler issue since the problem only appears when using the -G option on nvcc. Unfortunately I'm not able to follow the thrust source code here well enough to see where the problem occurs. I've attached a smallish testcase that can reproduce the error consistently.

lb_output_itr.cu source file to reproduce the error
#include <thrust/binary_search.h>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>
#include <thrust/host_vector.h>
#include <iostream>
#include <type_traits>
#include <vector>

// build failing one using:
//   nvcc -std=c++17 -g -G lb_output_itr.cu -o lb_output_itr
//
// build non-failing one by just removing the -G option

struct output_indexalator {
  using difference_type   = ptrdiff_t;
  using value_type        = int;
  using pointer           = int *;
  using iterator_category = std::random_access_iterator_tag;
  using reference         = output_indexalator const &;

  output_indexalator()                           = default;
  output_indexalator(output_indexalator const &) = default;
  output_indexalator(output_indexalator &&)      = default;
  output_indexalator &operator=(output_indexalator const &) = default;
  output_indexalator &operator=(output_indexalator &&) = default;

  __host__ __device__ output_indexalator &operator+=(difference_type offset)
  {
    p_ += offset * width_;
    return *this;
  }

  __host__ __device__ output_indexalator operator+(difference_type offset) const
  {
    auto tmp = output_indexalator(*this);
    tmp.p_ += (offset * width_);
    return tmp;
  }

  __device__ output_indexalator const operator[](int idx) const
  {
    output_indexalator tmp{*this};
    tmp.p_ += (idx * width_);
    return tmp;
  }

  __device__ output_indexalator const &operator*() const
  {
    printf("%p: op*() %p\n", this, p_);
    return *this;
  }

  __device__ output_indexalator const &operator=(int const value) const
  {
    void *tp = p_;
    printf("%p: set(%p,%d)\n", this, tp, value);  // class data is garbage
    // crashes here:
    (*static_cast<int *>(tp)) = static_cast<int>(value);
    return *this;
  }

  output_indexalator(void *data, int width) : width_(width), p_{static_cast<char *>(data)} {}

  int width_;  /// integer type width = 1,2,4, or 8
  char *p_;    /// pointer to the integer data in device memory
};

template <typename T>
thrust::device_vector<T> make_device_vector(std::vector<T> data)
{
  thrust::host_vector<T> h_data{data.begin(), data.end()};
  thrust::device_vector<T> d_data{h_data};
  return d_data;
}

int main(void)
{
  std::vector<int> input{0, 1, 2, 3, 4};
  auto d_input = make_device_vector(input);
  std::vector<int> values{9, 1, 0, 2};
  auto d_values = make_device_vector(values);

  thrust::device_vector<int> output(d_values.size());
  auto itr = output_indexalator(output.data().get(), sizeof(int));

  thrust::lower_bound(thrust::device,
                      d_input.begin(),
                      d_input.end(),
                      d_values.begin(),
                      d_values.end(),
                      itr,  // output.begin(),
                      thrust::less<long>());

  thrust::host_vector<int> h_output(output);
  for (auto v : h_output) std::cout << " " << v;
  std::cout << std::endl;
  return 0;
}

Compile the source file using the following command:

nvcc -std=c++17 -g -G lb_output_itr.cu -o lb_output_itr

Running the resulting lb_output_itr executable gives the following result:

$ ./lb_output_itr 
0x7f65eefffbd0: op*() 0x7f65b7e00400
0x7f65eefffbd0: op*() 0x7f65b7e00404
0x7f65eefffbd0: op*() 0x7f65b7e00408
0x7f65eefffbd0: op*() 0x7f65b7e0040c
0x7f65eefffbd0: set(0x400000000,5)
0x7f65eefffbd0: set(0x400000000,1)
0x7f65eefffbd0: set(0x400000000,0)
0x7f65eefffbd0: set(0x400000000,2)
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  for_each: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
Aborted

The 0x400000000 should be the device pointer but the iterator object is getting trashed somewhere.

Building without the -G option will produce the correct result:

$ nvcc -std=c++17 -g -G lb_output_itr.cu -o lb_output_itr
$ ./lb_output_itr 
0x7f745cfffcc8: op*() 0x7f7427e00400
0x7f745cfffcc8: op*() 0x7f7427e00404
0x7f745cfffcc8: op*() 0x7f7427e00408
0x7f745cfffcc8: op*() 0x7f7427e0040c
0x7f745cfffcc8: set(0x7f7427e00400,5)
0x7f745cfffcc8: set(0x7f7427e00404,1)
0x7f745cfffcc8: set(0x7f7427e00408,0)
0x7f745cfffcc8: set(0x7f7427e0040c,2)
 5 1 0 2

The output_indexalator iterator being used here is a simplified version from a much larger set of code and has been pared down to provide a minimal reproducer for this issue.

I've verified the error occurs with the same results on my Linux 18.04 system with the following nvcc compiler versions (and associated thrust versions): V11.0.221, V11.1.105, V11.2.142, and V11.3.109

davidwendt avatar Jun 04 '21 21:06 davidwendt

Took a quick look at this in the debugger it appears that memory is getting corrupt -- the parallel_for algorithm's tile_base is consistently set to 0, but is a very large negative value by the time the crash happens. I haven't found a reason for this, and it's odd that it's only in debugging mode.

@davidwendt Has the RAPIDS team filed an nvcc bug for this?

alliepiper avatar Jun 07 '21 14:06 alliepiper

No, not yet. I wanted to get some help from you on creating the details for nvcc bug since I don't know what is happening. Do you think the information here is enough to open an nvcc bug?

davidwendt avatar Jun 07 '21 15:06 davidwendt

Ok, just wanted to make sure. I'm planning to spend a bit more time looking at this in case it is in our libraries before we escalate.

alliepiper avatar Jun 07 '21 16:06 alliepiper

I spent a couple more hours looking into this, and things seem to go off the rails around this line. The lhs of the assignment (thrust::get<1>(t)) is producing a null reference to an output_indexalator, but only in debug mode -- it is a valid object otherwise. This is why the class data is invalid during the output_indexalator::operator= method.

This tuple is produced from a zip_iterator of values_begin and output here.

I can't see anything going wrong in the source code, so this does seem like a compiler bug. I've filed NVBug 3322776 to have the compiler folks check it out.

alliepiper avatar Jun 08 '21 14:06 alliepiper

This is a tricky one. The root issue is that for output_indexalator, which is a random access iterator, operator* returns by reference but operator[] returns by value. This is allowed by the C++17 iterator concepts, but the Thrust for_each[_n] algorithms are not expecting this.

  template <class Input, class UnaryOp>
  struct for_each_f
  {
    Input input;
    UnaryOp op;

    THRUST_FUNCTION
    for_each_f(Input input, UnaryOp op)
        : input(input), op(op) {}

    template <class Size>
    THRUST_DEVICE_FUNCTION void operator()(Size idx)
    {
      op(raw_reference_cast(input[idx])); // HERE
    }
  };

This has a bad interaction with zip_iterator's dereference_iterator callable leading to a function returning a reference to a local.

The fix is quite simple: Thrust should avoid using operator[] on random access iterators. This eliminates the crash:

    template <class Size>
    THRUST_DEVICE_FUNCTION void operator()(Size idx)
    {
      op(raw_reference_cast(*(input + idx)));
    }

ericniebler avatar Mar 08 '23 19:03 ericniebler