Invalid CUDA code: host computation on device buffer after GPU transformations
The below SDFG produces invalid CUDA code, after applying apply_gpu_transformations:
The problem is that the tlet_1_scalar_expr nodes in the second nested level results in a symbolic expression computed as host code, but the result is written to a device buffer in GPU global memory. This causes a segmentation fault when the SDFG is called.
Generated host code:
int * __tmp3;
DACE_GPU_CHECK(cudaMalloc((void**)&__tmp3, __out_size_0 * sizeof(int)));
int __tmp0;
__tmp0 = (__tmp1 * __tmp1);
This SDFG can be reproduced from the GT4Py test case:
tests/next_tests/integration_tests/feature_tests/ffront_tests/test_execution.py::test_double_use_scalar
The current workaround is to run the simplify pass before calling apply_gpu_transformations, so that InlineSDFGs will bring the SDFG to a canonical form (see https://github.com/GridTools/gt4py/pull/1741).