The AllocOptPass is potentially slow (and runs too often) in some cases
Looking a bit into why https://github.com/JuliaLang/julia/pull/54520 caused such a big latency improvement, the issue (or at least one issue) is that with the broadcast code a large majority of the time is spent in our own alloc optimization pass:
As can be seen, this pass runs four times, each time taking quite a long time. The time spent in this pass almost completely disappears when the broadcasting code is replaced with a loop (as was done in https://github.com/JuliaLang/julia/pull/54520).
It might be possible with some latency gains here if one can:
- Figure out the reason this pass seems to have such a problem with the broadcasting code
- Understanding if it really is required to run the pass four times.
FWIW, the code generated by LinearAlgebra.exp is pretty large:
julia> write("file.ll", sprint(io->code_llvm(io, LinearAlgebra.exp!, Tuple{Array{Float64, 2}};
optimize=false, debuginfo=:none, dump_module=true)))
shell> ls -lah *.ll
-rw-r--r-- 1 tim tim 529K May 21 13:51 1.10.ll
-rw-r--r-- 1 tim tim 1.7M May 21 13:54 1.11.ll
-rw-r--r-- 1 tim tim 1.8M May 21 13:53 1.12-broadcast.ll
-rw-r--r-- 1 tim tim 548K May 21 13:53 1.12-loop.ll
Yes, https://github.com/JuliaLang/julia/pull/43200 makes it better:
-rw-r--r-- 1 kristoffercarlsson staff 1.8M May 21 14:44 exp_1.11.ll
-rw-r--r-- 1 kristoffercarlsson staff 491K May 21 14:36 exp_1.11_noinline.ll
but it doesn't seem to impact compilation time much:
# current
julia> @time @eval LinearAlgebra.exp!(rand(2,2))
1.988244 seconds (8.59 M allocations: 422.911 MiB, 4.09% gc time, 99.49% compilation time)
# no @inline
julia> @time @eval LinearAlgebra.exp!(rand(2,2))
2.040852 seconds (7.57 M allocations: 370.544 MiB, 3.47% gc time, 99.48% compilation time)
I guss a question is, is it reasonable for
function f(A::StridedMatrix{T}) where T
CC = T[64764752532480000.,32382376266240000.,7771770303897600.,
1187353796428800., 129060195264000., 10559470521600.,
670442572800., 33522128640., 1323241920.,
40840800., 960960., 16380.,
182., 1.]
tmp1, tmp2 = similar(A), similar(A)
tmp1 .= CC[14].*A .+ CC[12].*A .+ CC[10].*A
tmp2 .= CC[8].*A .+ CC[6].*A .+ CC[4].*A
tmp1 .= CC[13].*A .+ CC[11].*A .+ CC[9].*A
tmp2 .= CC[7].*A .+ CC[5].*A .+ CC[3].*A
return tmp1, tmp2
end
write("file.ll", sprint(io->code_llvm(io, f, Tuple{Array{Float64, 2}};
optimize=false, debuginfo=:none, dump_module=true)))
to expand to 13k lines of LLVM IR+