Invalid CUDA code: host computation on device buffer after GPU transformations

Open edopao opened this issue 1 year ago • 0 comments

The below SDFG produces invalid CUDA code, after applying apply_gpu_transformations:

The problem is that the tlet_1_scalar_expr nodes in the second nested level results in a symbolic expression computed as host code, but the result is written to a device buffer in GPU global memory. This causes a segmentation fault when the SDFG is called.

Generated host code:

int * __tmp3;
DACE_GPU_CHECK(cudaMalloc((void**)&__tmp3, __out_size_0 * sizeof(int)));

int __tmp0;
__tmp0 = (__tmp1 * __tmp1);

This SDFG can be reproduced from the GT4Py test case: tests/next_tests/integration_tests/feature_tests/ffront_tests/test_execution.py::test_double_use_scalar

The current workaround is to run the simplify pass before calling apply_gpu_transformations, so that InlineSDFGs will bring the SDFG to a canonical form (see https://github.com/GridTools/gt4py/pull/1741).

Nov 18 '24 15:11 edopao