dace Broadcasts to Shared Memory on GPU Runs in Serial

Describe the bug When running code on a GPU, if you have a block of shared memory and you broadcast a variable to it, the generated CUDA assigns in serial. In my reproducer, the performance and behavior is identical, but I've encountered a case where this bug led to slower code that was incorrect.

To Reproduce Steps to reproduce the behavior:

@dace.program
def reproduce():
    value: dace.float32 = 0

    for i, j in dace.map[0:I:I1, 0:J:J1]:
        smem = dace.ndarray((I1, J1), dtype=dace.float32, storage=dace.StorageType.GPU_Shared)
        smem[:] = 0.0

        for i1, j1 in dace.map[0:I1, 0:J1]:
            value += smem[i1, j1]
    
    return value

sdfg = reproduce.to_sdfg()
sdfg.apply_gpu_transformations()

blockentry = find_map_by_param(sdfg, 'j1')
blockentry.map.schedule = dace.ScheduleType.GPU_ThreadBlock

print(sdfg())

This code outputs the following CUDA for smem[:] = 0:

__shared__ float smem[1024];
{
    for (auto __i0 = 0; __i0 < 32; __i0 += 1) {
        for (auto __i1 = 0; __i1 < 32; __i1 += 1) {
            {
                float __out;

                ///////////////////
                // Tasklet code (assign_13_8)
                __out = 0.0;
                ///////////////////

                smem[((32 * __i0) + __i1)] = __out;
            }
        }
    }
}

Expected behavior Since the outermost loop is a GPU_Device loop, the broadcast should be a GPU_ThreadBlock loop. Using the following Python code:

@dace.program
def correct():
    value: dace.float32 = 0

    for i, j in dace.map[0:I:I1, 0:J:J1]:
        smem = dace.ndarray((I1, J1), dtype=dace.float32, storage=dace.StorageType.GPU_Shared)
        
        for k, l in dace.map[0:I1, 0:J1]:
            smem[k, l] = 0.0

        for i1, j1 in dace.map[0:I1, 0:J1]:
            value += smem[i1, j1]
    
    return value

sdfg = correct.to_sdfg()
sdfg.apply_gpu_transformations()

blockentry = find_map_by_param(sdfg, 'j1')
blockentry.map.schedule = dace.ScheduleType.GPU_ThreadBlock

blockentry = find_map_by_param(sdfg, 'l')
blockentry.map.schedule = dace.ScheduleType.GPU_ThreadBlock

print(sdfg())

The correct CUDA is generated:

 __shared__ float smem[1024];
{
    {
        {
            int l = threadIdx.x;
            int k = threadIdx.y;
            {
                {
                    {
                        float __out;

                        ///////////////////
                        // Tasklet code (assign_28_12)
                        __out = 0.0;
                        ///////////////////

                        smem[((32 * k) + l)] = __out;
                    }
                }
            }
        }
    }
}

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: RHEL 7.9
DaCe Version 0.14.2

Jun 24 '23 04:06 computablee

Hello, and thank you for reporting this bug. Could you check if the issue has been fixed in the latest master?

Jun 30 '23 13:06 alexnick83

As of the current master, the issue appears to persist.

Jul 01 '23 00:07 computablee

It is possible to get the desired behavior by calling GPUTransform in the following manner:

sdfg.apply_gpu_transformations(sequential_innermaps=False)

The generated code should be:

__global__ void __launch_bounds__(1024) reproduce_19_0_1_0(float * __restrict__ value) {
    {
        {
            int j = (32 * blockIdx.x);
            int i = (32 * blockIdx.y);
            __shared__ float smem[1024];
            {
                {
                    {
                        int __i1 = threadIdx.x;
                        int __i0 = threadIdx.y;
                        {
                            {
                                {
                                    float __out;

                                    ///////////////////
                                    // Tasklet code (assign_21_8)
                                    __out = 0.0;
                                    ///////////////////

                                    smem[((32 * __i0) + __i1)] = __out;
                                }
                            }
                        }
                    }
                }
            }
            __syncthreads();
            {
            ...

Jul 04 '23 15:07 alexnick83

That makes sense. I suppose I figure that this should be the default behavior, and you shouldn't have to "opt-in" with sequential_innermaps=False.

Jul 04 '23 20:07 computablee

That makes sense. I suppose I figure that this should be the default behavior, and you shouldn't have to "opt-in" with sequential_innermaps=False.

It is a bit more complicated because the primary target for the GPU transformation is generic/CPU-based SDFGs. Such SDFGs will have arbitrary Map nests, where the inner Maps may not match any thread-block sizes implied by the outer Maps. This is why the default behavior is to render inner Maps sequentially, which unfortunately doesn't always work well in cases where the SDFG is already partially GPU-transformed.

The bug that you describe above is clear, and we will look into fixing it. Thank you for reporting it!

Jul 05 '23 12:07 alexnick83

dace dace copied to clipboard

Broadcasts to Shared Memory on GPU Runs in Serial

dace
dace copied to clipboard