cub Problem with cub::DeviceReduce::Sum and integer addition

Hi,

I am observing a problem with the cub library for sum reduction with standard datatypes.

Summary

A sum reduction using cub::DeviceReduce::Sum for integers causes pytorch code to crash at a later point in time, but none of the compile time or runtime checks shows a problem beforehand. When using cub::DeviceReduce::Reduce with a standard template-sum function, everything works as expected.

Code to reproduce

This is the code I use to reproduce the bug:

// Error check taken from
// https://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api
// for a quick example.
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
   if (code != cudaSuccess) 
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}

  // Pretty much the example from here:
  // https://nvlabs.github.io/cub/structcub_1_1_device_reduce.html#ab7f21e8255eb842aaf74305975ae607f
  int num_items = 7;  // e.g., 7
  int *d_in = NULL;   // e.g., [8, 6, 7, 5, 3, 0, 9]
  int *d_out = NULL;  // e.g., [-]
  int in[7] = {8, 6, 7, 5, 3, 0, 9};
  cudaMalloc((void **)&d_in, sizeof(int) * 7);
  cudaMalloc((void **)&d_out, sizeof(int));
  cudaMemcpy(d_in, in, sizeof(int) * 7, cudaMemcpyHostToDevice);
  // Just to see whether the kernel actually writes to this position.
  int out_init = 123;
  cudaMemcpy(d_out, &out_init, sizeof(int), cudaMemcpyHostToDevice);
  // Make sure everything went well so far.
  gpuErrchk(cudaGetLastError());
  void *d_temp_storage = NULL;
  size_t temp_storage_bytes = 0;
  // Activate `debug_synchronous`
  cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, d_in, d_out, 7, 0,
                         true);
  // Allocate temporary storage
  cudaMalloc(&d_temp_storage, temp_storage_bytes);
  // Run sum-reduction
  cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, d_in, d_out, 7, 0,
                         true);
  cudaDeviceSynchronize();
  gpuErrchk(cudaGetLastError());
  int sum;
  cudaMemcpy(&sum, d_out, sizeof(int), cudaMemcpyDeviceToHost);
  printf("sum: %d.\n", sum);

This produces the result "123" for me. Even more, none of the error checks shows a problem (even though debug_synchronous is set to true)! The output is here:

Invoking DeviceReduceSingleTileKernel<<<1, 256, 0, 0>>>(), 16 items per thread
sum: 123.

The program continues to run and then crashes later after the next Pytorch error check (presumably after the next cub call, though I'm not sure) like so:

RuntimeError: cuda runtime error (9) : invalid configuration argument at /pytorch/aten/src/THC/generic/THCTensorMath.cu:35

Using a custom addition

This is not happening for custom operator+ operations. Also, if I use the regular reduction with a self-defined sum operation:

struct CustomAdd{
    template <typename T>
    __device__ __forceinline__
    T operator()(const T &a, const T &b) const {
        return a + b;
    }
};

// ...

  CustomAdd add_op;
  cub::DeviceReduce::Reduce(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items, add_op, 0, 0, true);
  // Allocate temporary storage
  cudaMalloc(&d_temp_storage, temp_storage_bytes);
  // Run sum-reduction
  cub::DeviceReduce::Reduce(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items, add_op, 0, 0, true);

I'm getting the expected output:

Invoking DeviceReduceSingleTileKernel<<<1, 256, 0, 0>>>(), 16 items per thread
sum: 38.

System setup

I'm using CUDA 10.1 with cub 1.8.0 and clang 8 as host compiler. I only observe this problem compiling without optimizations.

What next?

This was super hard to track down because none of the error checking routines showed up anything. I'm very curious about your feedback, insights and reproducibility with other configurations and would be happy to provide more information as required.

Sep 27 '19 04:09 classner

So, the example you have above works properly, right?

Do you have a self-contained reproducer that I can use to manifest your error? (source + compile string + GPU model used)

FWIW, in 99.9% of cases like this (viz. “CUB reports an error or gets the wrong answer, but it happens only in the context of my larger application”), it’s actually the symptom of some other latent bug occurring earlier in the application.

BTW, cudaLastError() isn’t sufficient to flush all error codes from the Cuda Runtime. Rather, you will want to do the following where you check “if everything went well”:

// Check for failure to launch error = cudaPeekAtLastError()

// flush asynchronous errors error = cudaDeviceSynchronize()

Try that and see if you suss any issues from other kernels or memory allocations occurring prior to the CUB reduction call.

Best,

Duane

Sep 27 '19 17:09 dumerrill

Hi Duane,

thanks for your quick reply!

So, the example you have above works properly, right?

Do you have a self-contained reproducer that I can use to manifest your error? (source + compile string + GPU model used)

The very example I pasted here is breaking for me! I isolated the issue to provide you with a minimal breaking example, the one above.

FWIW, in 99.9% of cases like this (viz. “CUB reports an error or gets the wrong answer, but it happens only in the context of my larger application”), it’s actually the symptom of some other latent bug occurring earlier in the application.

I'm very aware of that and CUB is the last place I would've suspected as the cause of my problem. Just after spending hours on debugging, that's what I found. I tried my best to isolate the issue. In the end, I had exactly the code I pasted above in one function. It is allocating it's own memory completely isolated from the rest of my application. Let me know if I can do any other thing to even further narrow down the problem.

Try that and see if you suss any issues from other kernels or memory allocations occurring prior to the CUB reduction call.

Done that, there's nothing hidden from before. I extra inserted the

cudaDeviceSynchronize();
gpuErrchk(cudaGetLastError());

in the code to make sure I'm retrieving all asynchronous errors from the GPU and that should be sufficient, right?

Sep 27 '19 22:09 classner

Sorry, I didn’t see “main()” and you made reference to Pytorch. If this is the entire program, can you share the nvcc commandline you are using to compile, and the specific GPU you are using? I’d need that to reproduce.

Sep 28 '19 01:09 dumerrill

Err, clang commandline.

Sep 28 '19 01:09 dumerrill

Also ran into this just now. I was building against the system CUDA 10.2 and installed Pytorch from PyPI which is built against 10.1 and comes with the required libraries vendored in the wheel (i.e., it ships its own libcudart.so.10.1).

I run DeviceReduce::Sum on a 256x3x224x224 element tensor with different dtypes (uint8, int32, float) and compare against Pytorch sum.

Compile with release config, everything works as expected.

Compile with debug config, Pytorch encounters CUDA error: invalid configuration argument followed by a segfault.

While using multiple versions of CUDA in the same process doesn't sound like a supported config, it is still odd that it's fine depending on how my module is built. Maybe just luck though. Installed CUDA 10.1 instead and it's working now.

Feb 05 '20 11:02 jfolz

Hey @classner @jfolz is this still an issue you're running into?

Feb 23 '23 16:02 jrhemstad

Not working on this right now, so not sure what the status is in the current CUDA and CUB versions.

Feb 23 '23 23:02 classner

Same. Issue can be closed from my point of view as all involved software has changed dramatically since then.

Feb 24 '23 09:02 jfolz