DirectXShaderCompiler
DirectXShaderCompiler copied to clipboard
Incorrect control flow modifications in ScopeNestedCFG pass cause WARP to have wrong shader behavior
Description A customer reported incorrect results when running a shader on WARP. Arbitrary tweaks to the source HLSL cause DXC to produce drastically different control flow (due to fewer static analysis opportunities), which can work around the problem. The root cause of WARP's incorrect behavior is due to a bug introduced by the ScopeNestedCFG pass that WARP uses to structurize the CFG before translating into its own IR.
Steps to Reproduce
I was given the attached PIX capture. That PIX capture contains HLSL sources (if you have issues viewing source from the capture, I can extract it and upload it separately). Running that PIX capture on a debug version of WARP, I was able to extract the IR before and after the ScopeNestedCFG passes. See before and after. Note that the before has already had the CFG simplification, loop simplification, and createDemoteRegisterToMemoryHlslPass
(including breaking critical edges) passes run on it. The original is also available: original.
The key parts of these dumps to focus on are the end of the second nested loop:
Original:
; <label>:70 ; preds = %52, %51
%71 = phi i32 [ %48, %51 ], [ %46, %52 ]
%72 = getelementptr inbounds [32 x i32], [32 x i32]* @"\01?Scratch@ShaderTestPrivate@@3UPerThreadScratchData@[email protected]", i32 0, i32 %48
%73 = load i32, i32* %72, align 4, !tbaa !10
%74 = icmp eq i32 %73, 2
br i1 %74, label %76, label %75
; <label>:75 ; preds = %70
store i32 4, i32* %72, align 4, !tbaa !10
br label %63
; <label>:76 ; preds = %70
store i32 3, i32* %72, align 4, !tbaa !10
store i32 2, i32* getelementptr inbounds ([32 x i32], [32 x i32]* @"\01?Scratch@ShaderTestPrivate@@3UPerThreadScratchData@[email protected]", i32 0, i32 2), align 4, !tbaa !10
%77 = add nsw i32 %49, 1
br label %45
Note that %72
is produced in one block, and then used in both the if/else blocks that succeed it. The IR prior to scope nested CFG looks similar:
; <label>:71 ; preds = %._crit_edge.9, %51
%72 = load i32, i32* %.reg2mem87
store i32 %72, i32* %.reg2mem21
%.reload37 = load i32, i32* %.reg2mem34
%73 = getelementptr inbounds [32 x i32], [32 x i32]* @"\01?Scratch@ShaderTestPrivate@@3UPerThreadScratchData@[email protected]", i32 0, i32 %.reload37
%74 = load i32, i32* %73, align 4, !tbaa !10
%75 = icmp eq i32 %74, 2
br i1 %75, label %77, label %76
; <label>:76 ; preds = %71
store i32 4, i32* %73, align 4, !tbaa !10
%.reload23 = load i32, i32* %.reg2mem21
%.reload35 = load i32, i32* %.reg2mem34
store i32 %.reload35, i32* %.reg2mem88
store i32 %.reload23, i32* %.reg2mem89
br label %64
; <label>:77 ; preds = %71
store i32 3, i32* %73, align 4, !tbaa !10
store i32 2, i32* getelementptr inbounds ([32 x i32], [32 x i32]* @"\01?Scratch@ShaderTestPrivate@@3UPerThreadScratchData@[email protected]", i32 0, i32 2), align 4, !tbaa !10
%.reload32 = load i32, i32* %.reg2mem30
%78 = add nsw i32 %.reload32, 1
store i32 %78, i32* %.reg2mem19
%.reload20 = load i32, i32* %.reg2mem19
%.reload22 = load i32, i32* %.reg2mem21
store i32 %.reload20, i32* %.reg2mem91
store i32 2, i32* %.reg2mem92
store i32 2, i32* %.reg2mem93
store i32 %.reload22, i32* %.reg2mem94
br label %45
The pointer value is now %73
but the structure is the same.
The problem arises from the scope nested CFG pass, which starts to duplicate some of these basic blocks so that they can be appropriately nested inside each other for control flow reasons. It produces the following blocks:
; <label>:121 ; preds = %._crit_edge.9
%122 = load i32, i32* %.reg2mem87
store i32 %122, i32* %.reg2mem21
%.reload37 = load i32, i32* %.reg2mem34
%123 = getelementptr inbounds [32 x i32], [32 x i32]* @"\01?Scratch@ShaderTestPrivate@@3UPerThreadScratchData@[email protected]", i32 0, i32 %.reload37
%124 = load i32, i32* %123, align 4, !tbaa !12
%125 = icmp eq i32 %124, 2
br i1 %125, label %132, label %dx.LoopExitHelper.123, !dx.BranchKind !18
; <label>:126 ; preds = %dx.EndIfScope.142
%127 = load i32, i32* %.reg2mem87
store i32 %127, i32* %.reg2mem21
%.reload37.143 = load i32, i32* %.reg2mem34
%128 = getelementptr inbounds [32 x i32], [32 x i32]* @"\01?Scratch@ShaderTestPrivate@@3UPerThreadScratchData@[email protected]", i32 0, i32 %.reload37.143
%129 = load i32, i32* %128, align 4, !tbaa !12
%130 = icmp eq i32 %129, 2
br i1 %130, label %134, label %dx.LoopExitHelper.123.148, !dx.BranchKind !18
; <label>:131 ; preds = %dx.LoopExitHelperIf.125
store i32 4, i32* %123, align 4, !tbaa !12
%.reload23 = load i32, i32* %.reg2mem21
%.reload35 = load i32, i32* %.reg2mem34
store i32 %.reload35, i32* %.reg2mem88
store i32 %.reload23, i32* %.reg2mem89
br label %dx.EndIfScope.179
; <label>:132 ; preds = %121
store i32 3, i32* %123, align 4, !tbaa !12
store i32 2, i32* getelementptr inbounds ([32 x i32], [32 x i32]* @"\01?Scratch@ShaderTestPrivate@@3UPerThreadScratchData@[email protected]", i32 0, i32 2), align 4, !tbaa !12
%.reload32 = load i32, i32* %.reg2mem30
%133 = add nsw i32 %.reload32, 1
store i32 %133, i32* %.reg2mem19
%.reload20 = load i32, i32* %.reg2mem19
%.reload22 = load i32, i32* %.reg2mem21
store i32 %.reload20, i32* %.reg2mem91
store i32 2, i32* %.reg2mem92
store i32 2, i32* %.reg2mem93
store i32 %.reload22, i32* %.reg2mem94
br label %dx.LoopContinue.132
; <label>:134 ; preds = %126
store i32 3, i32* %128, align 4, !tbaa !12
store i32 2, i32* getelementptr inbounds ([32 x i32], [32 x i32]* @"\01?Scratch@ShaderTestPrivate@@3UPerThreadScratchData@[email protected]", i32 0, i32 2), align 4, !tbaa !12
%.reload32.144 = load i32, i32* %.reg2mem30
%135 = add nsw i32 %.reload32.144, 1
store i32 %135, i32* %.reg2mem19
%.reload20.145 = load i32, i32* %.reg2mem19
%.reload22.146 = load i32, i32* %.reg2mem21
store i32 %.reload20.145, i32* %.reg2mem91
store i32 2, i32* %.reg2mem92
store i32 2, i32* %.reg2mem93
store i32 %.reload22.146, i32* %.reg2mem94
br label %dx.LoopContinue.132.147
The whole CFG is too complicated to paste here, but you'll note that now there are two pointers, %123
produced in block 121, and %128
produced in block 126. There are also now 3 consumers, in blocks 131, 132, and 134, two of which are consuming %123
and one consuming %128
. The problem is that there exists a path through the control flow which enters block 126 producing %128
, and then block 131 consuming %123
, but %123
wasn't produced.
The result is that WARP ends up writing the value 4 to the wrong location in memory, and when the logic in the outer loop attempts to read it, it reads the wrong value, resulting in the outer loop terminating too early.
And just for my own record before I lose context on this, the path through the final CFG that's broken is:
- 126
- dx.LoopExitHelper.123.148
- dx.LoopExit.118
- dx.LoopExitHelperIf.131
- dx.LoopExitHelperIf.129
- dx.LoopExitHelperIf.127
- dx.LoopExitHelperIf.125
- 131
@jenatali - we're not going to schedule time to investigate/fix this issue, but we would happily review and accept a PR addressing it (which is why it is on the dormant milestone).