Enzyme CUDA: Getting zero gradients

I am trying Enzyme for CUDA, but I am getting zero gradient. Specifically, for y = f(x), I get ∇x = 0 while ∇y is left as its original value. It seems like the gradient is not propagated at all. This only happens on CUDA, and it works fine on CPU.

To reproduce, you can use the following code, which is a modified version of the code in the guide.

#include <stdio.h>

#define n (5856)
#define m (65536)

void __device__ foo_impl(double *faces_ptr, double *soft_colors_ptr) {
#define face(fn, a, b) (faces_ptr[(fn)*3 * 3 + (a)*3 + (b)])
#define soft_colors(k, pn) (soft_colors_ptr[(k)*m + (pn)])
    for (int pn = 0; pn < m; pn++) {
        double soft_color[3] = {0, 0, 0};
        for (int fn = 0; fn < n; fn++) {
            double w_clip[3] = {1, 1, 1};

            // CORRECT
            /*double zp = w_clip[0] * face(fn, 0, 2) +
                        w_clip[1] * face(fn, 1, 2) + w_clip[2] * face(fn, 2,
               2);*/

            // WRONG
            double zp =
                1. / (w_clip[0] / face(fn, 0, 2) + w_clip[1] / face(fn, 1, 2) +
                      w_clip[2] / face(fn, 2, 2));

            for (int k = 0; k < 3; k++) {
                soft_color[k] += zp;
            }
        }
        for (int k = 0; k < 3; k++) {
            soft_colors(k, pn) = soft_color[k];
        }
    }
#undef face
#undef soft_colors
}

typedef void (*f_ptr)(double *, double *);

extern void __device__ __enzyme_autodiff(f_ptr, int, double *, double *, int,
                                         double *, double *);

void __global__ foo(double *x_in, double *x_out) { foo_impl(x_in, x_out); }

int __device__ enzyme_dup;
int __device__ enzyme_out;
int __device__ enzyme_const;

void __global__ foo_grad(double *x, double *d_x, double *y, double *d_y) {
    __enzyme_autodiff(foo_impl, enzyme_dup, x, d_x, enzyme_dup, y, d_y);
}

int main() {
    double *x, *d_x, *y, *d_y;  // device pointers

    cudaMalloc(&x, n * 3 * 3 * sizeof(*x));
    cudaMalloc(&d_x, n * 3 * 3 * sizeof(*d_x));
    cudaMalloc(&y, 3 * m * sizeof(*y));
    cudaMalloc(&d_y, 3 * m * sizeof(*d_y));

    double *host_x = new double[n * 3 * 3];
    double *host_d_x = new double[n * 3 * 3];
    double *host_y = new double[3 * m];
    double *host_d_y = new double[3 * m];
    for (int i = 0; i < n; i++) {
        for (int j = 0; j < 3; j++) {
            for (int k = 0; k < 3; k++) {
                host_x[i * 3 * 3 + j * 3 + k] = 1;
                host_d_x[i * 3 * 3 + j * 3 + k] = 0;
            }
        }
    }
    for (int i = 0; i < 3; i++) {
        for (int j = 0; j < m; j++) {
            host_y[i * m + j] = 0;
            host_d_y[i * m + j] = 1;
        }
    }

    cudaMemcpy(x, host_x, n * 3 * 3 * sizeof(*x), cudaMemcpyHostToDevice);
    cudaMemcpy(d_x, host_d_x, n * 3 * 3 * sizeof(*d_x), cudaMemcpyHostToDevice);
    cudaMemcpy(y, host_y, 3 * m * sizeof(*y), cudaMemcpyHostToDevice);
    cudaMemcpy(d_y, host_d_y, 3 * m * sizeof(*d_y), cudaMemcpyHostToDevice);

    // foo<<<1,1>>>(x, y); fwd-pass only
    foo_grad<<<1, 1>>>(x, d_x, y, d_y);  // fwd and bkwd pass

    cudaDeviceSynchronize();  // synchroniz

    cudaMemcpy(host_x, x, n * 3 * 3 * sizeof(*x), cudaMemcpyDeviceToHost);
    cudaMemcpy(host_d_x, d_x, n * 3 * 3 * sizeof(*d_x), cudaMemcpyDeviceToHost);
    cudaMemcpy(host_y, y, 3 * m * sizeof(*y), cudaMemcpyDeviceToHost);
    cudaMemcpy(host_d_y, d_y, 3 * m * sizeof(*d_y), cudaMemcpyDeviceToHost);

    for (int i = 0; i < n; i++) {
        for (int j = 0; j < 3; j++) {
            for (int k = 0; k < 3; k++) {
                printf("x(%d, %d, %d) = %f\n", i, j, k,
                       host_x[i * 3 * 3 + j * 3 + k]);
            }
        }
    }
    for (int i = 0; i < 3; i++) {
        for (int j = 0; j < m; j++) {
            printf("y(%d, %d) = %f\n", i, j, host_y[i * m + j]);
        }
    }
    for (int i = 0; i < n; i++) {
        for (int j = 0; j < 3; j++) {
            for (int k = 0; k < 3; k++) {
                printf("d_x(%d, %d, %d) = %f\n", i, j, k,
                       host_d_x[i * 3 * 3 + j * 3 + k]);
            }
        }
    }
    for (int i = 0; i < 3; i++) {
        for (int j = 0; j < m; j++) {
            printf("d_y(%d, %d) = %f\n", i, j, host_d_y[i * m + j]);
        }
    }
}

Compile and run it with the following commands:

/path/to/clang test.cu -Xclang -load -Xclang /path/to/Enzyme/enzyme/build/Enzyme/ClangEnzyme-12.so -O2 -fno-vectorize -fno-unroll-loops -fPIC --cuda-gpu-arch=sm_70 -lcudart -L/path/to/cuda-10.1.243/lib64 -std=c++17 -lstdc++
./a.out

You will get host_d_x to be 0, and host_d_y left as 1.

I have checked there are no CUDA errors by running it with cuda-memcheck.

You may also try commenting out the WRONG line and uncommenting the CORRECT line. This will cause the gradient to follow again and host_d_x will no longer be 0. However, this doesn’t mean that the WRONG line is the root cause of the problem. There are many changes to the code that can make the gradients follow again, including change n or m to a smaller value. I think expression folding or other optimizations from LLVM is hiding the real cause of the problem. Another guess of the cause could be overflow of 32-bit integers somewhere in the differentiated program.

Tested with Enzyme 0.0.51, Clang 12.0.1, CUDA 10.1.243.

Apr 08 '23 07:04 roastduck

Performance note, tou can (and should) remove the -fno-vectorize -fno-unroll-loops when using ClangEnzyme.

Regardless investigating.

Apr 12 '23 02:04 wsmoses

Performance note, tou can (and should) remove the -fno-vectorize -fno-unroll-loops when using ClangEnzyme.

Regardless investigating.

Please update the guide https://enzyme.mit.edu/getting_started/CUDAGuide/.

Apr 12 '23 02:04 roastduck

Indeed and in progress, see https://github.com/EnzymeAD/www/pull/14

Apr 12 '23 03:04 wsmoses

This does appear to be a memory error. Specifically because it thinks your input and output could overlap, it is caching each variable -- taking gigabytes per thread -- and thus clearly oom'ing.

Marking the two function inputs as restrict removes the caching, but something weird is still going on.

Apr 12 '23 15:04 wsmoses

This does appear to be a memory error. Specifically because it thinks your input and output could overlap, it is caching each variable -- taking gigabytes per thread -- and thus clearly oom'ing.

I also suspect there could be OOM, so I checked the execution with cuda-memcheck, and found nothing. (And just to confirm, I think Enzyme would also check for failed cuda memory allocation, right?) So I think there could be other errors happen even before the OOM. For example, if you are using 32-bit integers, the address or size can overflow for a gigabyte-sized buffer.

Apr 12 '23 15:04 roastduck

Enzyme will assume allocation functions will succeed or assert (which apparently is not the case here). I'm guessing what could be happening is that the allocation itself crashes the kernel before any load/stores and perhaps cuda-memcheck doesn't check that?

But perhaps looking at an error code after the kernel launch could?

Apr 12 '23 15:04 wsmoses

Of course it is the best practice to check the error code after each kernel launch. You can check invalid kernel launch arguments by calling cudaGetLastError after the kernel launches, and check runtime error inside the kernel by checking the error code returned from a following synchronization call. In my practice, both are needed.

But AFAIK, cuda-memcheck not only checks each load/store, but also checks the error code from each kernel launch or CUDA API call.

Apr 12 '23 15:04 roastduck