dace
dace copied to clipboard
Broadcasts to Shared Memory on GPU Runs in Serial
Describe the bug When running code on a GPU, if you have a block of shared memory and you broadcast a variable to it, the generated CUDA assigns in serial. In my reproducer, the performance and behavior is identical, but I've encountered a case where this bug led to slower code that was incorrect.
To Reproduce Steps to reproduce the behavior:
@dace.program
def reproduce():
value: dace.float32 = 0
for i, j in dace.map[0:I:I1, 0:J:J1]:
smem = dace.ndarray((I1, J1), dtype=dace.float32, storage=dace.StorageType.GPU_Shared)
smem[:] = 0.0
for i1, j1 in dace.map[0:I1, 0:J1]:
value += smem[i1, j1]
return value
sdfg = reproduce.to_sdfg()
sdfg.apply_gpu_transformations()
blockentry = find_map_by_param(sdfg, 'j1')
blockentry.map.schedule = dace.ScheduleType.GPU_ThreadBlock
print(sdfg())
This code outputs the following CUDA for smem[:] = 0
:
__shared__ float smem[1024];
{
for (auto __i0 = 0; __i0 < 32; __i0 += 1) {
for (auto __i1 = 0; __i1 < 32; __i1 += 1) {
{
float __out;
///////////////////
// Tasklet code (assign_13_8)
__out = 0.0;
///////////////////
smem[((32 * __i0) + __i1)] = __out;
}
}
}
}
Expected behavior
Since the outermost loop is a GPU_Device
loop, the broadcast should be a GPU_ThreadBlock
loop. Using the following Python code:
@dace.program
def correct():
value: dace.float32 = 0
for i, j in dace.map[0:I:I1, 0:J:J1]:
smem = dace.ndarray((I1, J1), dtype=dace.float32, storage=dace.StorageType.GPU_Shared)
for k, l in dace.map[0:I1, 0:J1]:
smem[k, l] = 0.0
for i1, j1 in dace.map[0:I1, 0:J1]:
value += smem[i1, j1]
return value
sdfg = correct.to_sdfg()
sdfg.apply_gpu_transformations()
blockentry = find_map_by_param(sdfg, 'j1')
blockentry.map.schedule = dace.ScheduleType.GPU_ThreadBlock
blockentry = find_map_by_param(sdfg, 'l')
blockentry.map.schedule = dace.ScheduleType.GPU_ThreadBlock
print(sdfg())
The correct CUDA is generated:
__shared__ float smem[1024];
{
{
{
int l = threadIdx.x;
int k = threadIdx.y;
{
{
{
float __out;
///////////////////
// Tasklet code (assign_28_12)
__out = 0.0;
///////////////////
smem[((32 * k) + l)] = __out;
}
}
}
}
}
}
Screenshots If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
- OS: RHEL 7.9
- DaCe Version 0.14.2
Hello, and thank you for reporting this bug. Could you check if the issue has been fixed in the latest master?
As of the current master, the issue appears to persist.
It is possible to get the desired behavior by calling GPUTransform in the following manner:
sdfg.apply_gpu_transformations(sequential_innermaps=False)
The generated code should be:
__global__ void __launch_bounds__(1024) reproduce_19_0_1_0(float * __restrict__ value) {
{
{
int j = (32 * blockIdx.x);
int i = (32 * blockIdx.y);
__shared__ float smem[1024];
{
{
{
int __i1 = threadIdx.x;
int __i0 = threadIdx.y;
{
{
{
float __out;
///////////////////
// Tasklet code (assign_21_8)
__out = 0.0;
///////////////////
smem[((32 * __i0) + __i1)] = __out;
}
}
}
}
}
}
__syncthreads();
{
...
That makes sense. I suppose I figure that this should be the default behavior, and you shouldn't have to "opt-in" with sequential_innermaps=False
.
That makes sense. I suppose I figure that this should be the default behavior, and you shouldn't have to "opt-in" with
sequential_innermaps=False
.
It is a bit more complicated because the primary target for the GPU transformation is generic/CPU-based SDFGs. Such SDFGs will have arbitrary Map nests, where the inner Maps may not match any thread-block sizes implied by the outer Maps. This is why the default behavior is to render inner Maps sequentially, which unfortunately doesn't always work well in cases where the SDFG is already partially GPU-transformed.
The bug that you describe above is clear, and we will look into fixing it. Thank you for reporting it!