LoopVectorization.jl
LoopVectorization.jl copied to clipboard
Handle loop carried dependencies that are apparent from constant offsets
using LoopVectorization
function grad!(π₯rec, π₯β, rec)
for i in 2:size(π₯rec, 1)-1
for j in 2:size(π₯rec, 2)-1
β°π1 = conj(π₯β[1])
β°π2 = rec[i, j + 1] + rec[i, j - 1]
β°π3 = rec[i + 1, j] + β°π2
β°π4 = rec[i - 1, j] + β°π3
β°π5 = 4 * rec[i, j]
β°π6 = β°π4 - β°π5
β°π7 = 2β°π6
β°π8 = β°π1 * β°π7
β°π9 = conj(β°π8)
β°π10 = rec[i - 1, j] + β°π3
β°π11 = β°π10 - β°π5
β°π12 = 2β°π11
β°π13 = β°π1 * β°π12
β°π14 = conj(β°π13)
β°π15 = rec[i - 1, j] + β°π3
β°π16 = β°π15 - β°π5
β°π17 = 2β°π16
β°π18 = β°π1 * β°π17
β°π19 = conj(β°π18)
β°π20 = rec[i - 1, j] + β°π3
β°π21 = β°π20 - β°π5
β°π22 = 2β°π21
β°π23 = β°π1 * β°π22
β°π24 = conj(β°π23)
β°π25 = rec[i - 1, j] + β°π3
β°π26 = β°π25 - β°π5
β°π27 = 2β°π26
β°π28 = -4β°π27
β°π29 = β°π1 * β°π28
β°π30 = conj(β°π29)
π₯rec[i - 1, j] = π₯rec[i - 1, j] + β°π9
π₯rec[i + 1, j] = π₯rec[i + 1, j] + β°π14
π₯rec[i, j + 1] = π₯rec[i, j + 1] + β°π19
π₯rec[i, j - 1] = π₯rec[i, j - 1] + β°π24
π₯rec[i, j] = π₯rec[i, j] + β°π30
end
end
end
function gradavx!(π₯rec, π₯β, rec)
@avx for i in 2:size(π₯rec, 1)-1
for j in 2:size(π₯rec, 2)-1
β°π1 = conj(π₯β[1])
β°π2 = rec[i, j + 1] + rec[i, j - 1]
β°π3 = rec[i + 1, j] + β°π2
β°π4 = rec[i - 1, j] + β°π3
β°π5 = 4 * rec[i, j]
β°π6 = β°π4 - β°π5
β°π7 = 2β°π6
β°π8 = β°π1 * β°π7
β°π9 = conj(β°π8)
β°π10 = rec[i - 1, j] + β°π3
β°π11 = β°π10 - β°π5
β°π12 = 2β°π11
β°π13 = β°π1 * β°π12
β°π14 = conj(β°π13)
β°π15 = rec[i - 1, j] + β°π3
β°π16 = β°π15 - β°π5
β°π17 = 2β°π16
β°π18 = β°π1 * β°π17
β°π19 = conj(β°π18)
β°π20 = rec[i - 1, j] + β°π3
β°π21 = β°π20 - β°π5
β°π22 = 2β°π21
β°π23 = β°π1 * β°π22
β°π24 = conj(β°π23)
β°π25 = rec[i - 1, j] + β°π3
β°π26 = β°π25 - β°π5
β°π27 = 2β°π26
β°π28 = -4β°π27
β°π29 = β°π1 * β°π28
β°π30 = conj(β°π29)
π₯rec[i - 1, j] = π₯rec[i - 1, j] + β°π9
π₯rec[i + 1, j] = π₯rec[i + 1, j] + β°π14
π₯rec[i, j + 1] = π₯rec[i, j + 1] + β°π19
π₯rec[i, j - 1] = π₯rec[i, j - 1] + β°π24
π₯rec[i, j] = π₯rec[i, j] + β°π30
end
end
end
x = rand(10,10);
dx1 = fill!(similar(x), 0); dx2 = fill!(similar(x), 0);
del = ones(1);
grad!(dx1, del, x); gradavx!(dx2, del, x)
dx1 .- dx2
Do you get a larger disagreement here than this?
julia> dx1 .- dx2 |> extrema
(-3.552713678800501e-15, 3.552713678800501e-15)
julia> dx1 .- dx2 |> extrema
(-4.304807428601304, 7.271464924157296)
You should be able to replicate results like that with @avx unroll=(1,4) for...
Ah indeed. unroll=4 did not, but (1,4) gives large numbers.
Awesome, would be curious to see how you did it. But no rush, obviously.
By making it a lot more conservative with respect to non-aliasing (e.g., stop it from reordering stores, and remove all the flags communicating non-aliasing to LLVM).
But the cost modeling had trouble, because it still wanted to unroll the j loop, even though this was very slow with the above adjustments:
Data would have have to repeatedly move through stores and reloads in quick succession, which is something LoopVectorization hadn't been set up to account for (assuming non-aliasing, that doesn't happen).
Unrolling the i loop by 10 led to those impressive speeds. I think the extreme factors helped, because it lets us get around the aliasing problems and actually have independent operations with which to take advantage of out of order execution.
Otherwise we're crippled by the slow store/load dependency chain through π₯rec.
The better solution than just preventing it from unrolling j to force it to unroll i is to actually (a) model this dependency chain, and (b) convert the dependency chain from a sequence of loads and stores into a sequence of reassignments (free) and shuffles (shuffles are needed for i - 1 -> i + 1 -> i):
π₯rec[i - 1, j] = π₯rec[i - 1, j] + β°π9
π₯rec[i + 1, j] = π₯rec[i + 1, j] + β°π14 # β°π14 == β°π9
π₯rec[i, j + 1] = π₯rec[i, j + 1] + β°π19 # β°π19 == β°π9
π₯rec[i, j - 1] = π₯rec[i, j - 1] + β°π24 # β°π24 == β°π9
π₯rec[i, j] = π₯rec[i, j] + β°π30
With it actually understanding this dependency chain, it could decide for itself whether to break it up by unrolling i a lot, or if it's better to unroll j to get more savings on stores/reloads (because it doesn't do the shuffles and reassignments correctly here in the face of aliasing yet, the j option was super slow, but it'll probably be pretty fast once it learns how to do it correctly).