glow
glow copied to clipboard
Failure on the unit test "CPU/BackendCorrectnessTest.dataParallelStackingTest"
My glow build encounters a failure on "CPU/BackendCorrectnessTest.dataParallelStackingTest" unit test. The unit test expected [3, 4] as an outcome, but my build generated [3, 3]. I built glow on ubuntu 20.04 with llvm 10, and also tested with llvm 8 but saw the same error message. I think [3, 3] is also a valid one depending on an assumption of operations through some analysis.
The unit test performs several ElementwiseAdds with the following overlapped tensor views. ACT is a 3-entry 1d tensor; tv1 and tv2 are the 2-entry 1D view of ACT but start from the first and second entry of ACT, respectively. Thus, tv1 and tv2 are partially overlapped, as depicted in the following text diagram.
act | 0 | 0 | 0 |
tv1 | 0 | 0 |
tv2 | 0 | 0 |
The test executes the following operations on tv1 and tv2.
bb.createElementAddInst("elem_add1", tv1, tv1, one);
bb.createElementAddInst("elem_add2", tv2, tv2, tv1);
bb.createElementAddInst("elem_add3", output, tv2, tv1);
The problem happens at "elem_add2". After executing "elem_add1", the view looks like this:
act | 1 | 1 | 0 |
tv1 | 1 | 1 |
tv2 | 1 | 0 |
"elem_add2" can produce two alternative results depending on when to read/update tensors.
Case 1) If we readily update memory after computing each element, the updated value would be read when we compute the next element. In this example, we perform "tv2[0] <= tv1[0] + tv2[0]". As tv2[0] and tv1[1] point the same memory location, we would see the updated value for tv1[1] when computing the next element "tv2[1] <= tv1[1] + tv2[1]", which generates the dependency between the first and second operations. In this case, the result would be:
act | 1 | 2 | 2 |
tv1 | 1 | 2 |
tv2 | 2 | 2 |
Case 2) However, we might read all elements of tv1 and tv2 to registers before computing; tv1 and tv2 would not see the updated value in the middle of computation. If that is possible, the result would be:
act | 1 | 2 | 1 |
tv1 | 1 | 2 |
tv2 | 2 | 1 |
For further analysis, I checked the generated IR and assembly code. The following is generated IR corresponding to "elem_add2."
; Function Attrs: nofree noinline norecurse nounwind
define internal fastcc void @libjit_stacked_kernel.3_3_specialized(float* noalias nocapture %0, float* noalias nocapture readonly %1) unnamed_addr #2 {
entry:
%2 = load float, float* %0, align 4
%3 = load float, float* %1, align 4
%4 = fadd reassoc nsz arcp contract float %2, %3
store float %4, float* %0, align 4
%5 = getelementptr float, float* %0, i64 1
%6 = load float, float* %5, align 4
%7 = getelementptr inbounds float, float* %1, i64 1
%8 = load float, float* %7, align 4
%9 = fadd reassoc nsz arcp contract float %6, %8
store float %9, float* %5, align 4
ret void
}
As ElementAdd is a data-parallel operator in definition, it sets "noalias" flag for its operands. So it could generate the optimized binary by ignoring dependency between tv1 and tv2 (especially, tv1[1] and tv2[0],) depending on backend implementations. In my build, the backend ignores the dependency between tv[1] and tv2[0], matching with Case 2.
.p2align 4, 0x90
.type libjit_stacked_kernel.4_4_specialized,@function
libjit_stacked_kernel.4_4_specialized:
vmovss (%rsi), %xmm0
vaddss (%rdx), %xmm0, %xmm0
vmovss %xmm0, (%rdi)
vmovss 4(%rsi), %xmm0
vaddss 4(%rdx), %xmm0, %xmm0
vmovss %xmm0, 4(%rdi)
retq
So, the current implementation could generate two results depending on the backend implementation and what assumption we use for operations. What assumption does glow rely on? Case 1 or Case 2?
I am facing the same issue when using the Docker image provided by the Dockerfile in glow/utils/docker/
. @hanhwi Did you find a solution to it in the mean time?
@yannickl96 Unfortunately, I have no solution for that so far.