DirectXShaderCompiler icon indicating copy to clipboard operation
DirectXShaderCompiler copied to clipboard

DXC emits racey zero-init code when numthreads(1,1,1)

Open amaiorano opened this issue 1 year ago • 3 comments

Description With a compute shader with numthreads(1,1,1) where a groupshared variable is initialized to zero, DXC emits different code depending on whether the write is conditional on the group index. One of these results in race conditions on some GPUs.

Steps to Reproduce

See this compiler explorer example: https://godbolt.org/z/GPjP58To9

The interesting bit is the function that initializes the groupshared var to 0:

void tint_zero_workgroup_memory(uint local_idx) {
#ifdef CHECK_LOCAL_INDEX
    if (local_idx < 1)
#endif
    {
        testVar = (0.0f).xxx;
    }
    GroupMemoryBarrierWithGroupSync();
}

There are 2 compilations, one without CHECK_LOCAL_INDEX defined, and one where it is defined. The diff is also shown at the bottom. The main things to note between the two:

  • When not checking the index, DXC emits the variable with the zeroinitializer tag, and there is no explicit code to zero-init the variable before the call to @dx.op.barrier.

  • Whereas with the index check, it emits the same variable with the undef tag, and there is now explicit code to zero-init the variable:

  store float 0.000000e+00, float addrspace(3)* getelementptr inbounds ([3 x float], [3 x float] addrspace(3)* @"\01?testVar@@3V?$vector@M$02@@A.v", i32 0, i32 0), align 4, !dbg !92
  store float 0.000000e+00, float addrspace(3)* getelementptr inbounds ([3 x float], [3 x float] addrspace(3)* @"\01?testVar@@3V?$vector@M$02@@A.v", i32 0, i32 1), align 4, !dbg !92
  store float 0.000000e+00, float addrspace(3)* getelementptr inbounds ([3 x float], [3 x float] addrspace(3)* @"\01?testVar@@3V?$vector@M$02@@A.v", i32 0, i32 2), align 4, !dbg !92
  br label %"\01?tint_zero_workgroup_memory@@[email protected]", !dbg !94

"\01?tint_zero_workgroup_memory@@[email protected]":
  call void @dx.op.barrier(i32 80, i32 9), !dbg !95

Our compiler (Tint) emits HLSL without a group index check when numthreads(1,1,1), since we reason that at most one thread is executing the code. In practice, we have seen this fail on some GPUs intermittently - in other words, the zero-init is racey. We can fix this, of course, by simply adding the conditional check for the index; but we'd like to know:

  1. Is this is actually a bug in DXC?
  2. Or is the problem that when writing HLSL, we cannot assume that the compiler uses numthreads to reason about data races, and must always guard access to shared variables?
  3. What does the zeroinitializer tag actually mean? Are GPUs supposed to ensure the memory is zero-initialized before any workgroup threads begin execution? Are certain GPU implementations wrong if they're not handling this properly?

amaiorano avatar Feb 22 '24 20:02 amaiorano

It looks to me like the case where the condition is removed DXC is promoting the unconditional assignment to just being the initializer on the global declaration. I believe that is totally valid in DXIL, so it is likely you're encountering an IHV bug.

@tex3d, @pow2clk, does that sound right to you two? If so, we should probably track this for our conformance testing too.

llvm-beanz avatar Feb 22 '24 21:02 llvm-beanz

#2122 introduced this pass. It seemed targeted for static globals rather than groupshared. That may have been in errror.

pow2clk avatar Feb 29 '24 16:02 pow2clk

#2122 introduced this pass. It seemed targeted for static globals rather than groupshared. That may have been in errror.

Is there documentation on how zeroinitializer should be interpreted for groupshared variables for IHVs? I think this would make clear whether the mistake is in DXC or in AMD's (and potentially other IHVs') driver.

amaiorano avatar Mar 05 '24 15:03 amaiorano