Halide icon indicating copy to clipboard operation
Halide copied to clipboard

Incorrect GPU codegen with multiple different types stored in memory

Open jrprice opened this issue 2 years ago • 0 comments

The GPU code generated for the below test produces incorrect results on the OpenCL and Metal backends (and the WIP WebGPU backend), and may well on others too. The test is derived from correctness/gpu_mixed_shared_mem_types, but uses the same MemoryType for all of the buffers. Changing the memory type for all of the buffers from Heap to GPUShared also produces incorrect results, but alternating between Heap and GPUShared produces correct results (and this is what gpu_mixed_shared_mem_types does). Using Stack instead also fixes the issue, as does using the same data type for all three buffers.

Running it with the OpenCL backend going through Oclgrind reveals both out-of-bounds accesses and data races.

This may be related to #4967, but I'm not familiar enough with this stuff to say for sure.

#include "Halide.h"
#include <stdio.h>

using namespace Halide;

int main(int argc, char **argv) {
    Var x("x"), xi("xi");

    Func out("out");

    Func a, b, c;
    a(x) = cast(UInt(8), x);
    b(x) = cast(UInt(16), x);
    c(x) = cast(UInt(8), x);

    a.compute_at(out, x).store_in(MemoryType::Heap).gpu_threads(x);
    b.compute_at(out, x).store_in(MemoryType::Heap).gpu_threads(x);
    c.compute_at(out, x).store_in(MemoryType::Heap).gpu_threads(x);

    out(x) = cast(UInt(32), a(x) + b(x) + c(x));
    out.gpu_tile(x, xi, 1);

    Buffer<uint32_t> output = out.realize({10});
    for (int x = 0; x < output.width(); x++) {
        uint32_t ref = 3 * x;
        if (output(x) != ref) {
            printf("FAILED: output(%d) = %d != %d\n", x, output(x), ref);
            return 1;
        }
    }
    printf("Success!\n");
    return 0;
}

jrprice avatar Jun 06 '22 17:06 jrprice