JuliaInterpreter.jl
JuliaInterpreter.jl copied to clipboard
Performance opportunities
In my initial performance work, the big 3 improvements were local method tables (framecode.methodtables), reusing framedata (the junk mechanism), and getting rid of runtime dispatch internally within the interpreter (by manual union-splitting, aka, if isa(x, ConcreteType)...).
Some possible future improvements:
- [ ] Recent changes diving into
Core._apply,Core._apply_latest, andinvokehurt performance. Perhaps some of that might be clawed back by moving more of this work intooptimize!? - [x]
ccalls are slow. I just pushed ateh/compiled_ccallthat contains an old attempt to create one compiled function perccall. I was most interested in seeing whether it would circumvent #28 (it didn't seem to), but I don't think I looked at its runtime implications (esp on a ccall-heavy workload). I suspect that branch is pretty close to usable, if anyone wants to pick it up and run with it. - [ ] specifically look for calls to
iterateand do something special, a.k.a., (1) determine whetheriteratecould possibly hit a breakpoint or error (if not then you're allowed to play tricks), (2) see if you can determine whether the loop will be long, and if so (3) pass the iterator and frame to a specialrun_loopfunction. That would have to compile for each iterator type, so is worth doing only if the loop is long. But you would save the time spent diving in and out ofiterate, and that could be quite helpful. - [x] apply similar treatment to arrayrefs. Even our builtin handling of
arrayrefcould be improved by separately treating small dimensions (e.g.,if nargs == 1 ... elseif nargs == 2 ... elseif nargs == 3 ... else <varargs version> end.
Obviously spending some time with ProfileView will be useful. I did a ton of that at the beginning and fixed most everything I that I noticed. But I haven't done it in a while and we may have regressed in places as we expanded our functionality.
I took a stab at teh/compiled_ccall and where I am kinda stuck at now is for functions that look like
unsafe_convert(::Type{Ptr{T}}, a::Array{T}) where {T} = ccall(:jl_array_ptr, Ptr{T}, (Any,), a)
I try to generate something like
unsafe_convert(::Type{Ptr{T}}, a::Array{T}) where {T} = JuliaInterpreter.CompiledCalls.ccall_jl_array_ptr(a, T)
function JuliaInterpreter.CompiledCalls.ccall_jl_array_ptr(arg1, ::Type{T}) where {T}
return ccall(:jl_array_ptr, Ptr{T}, (Any,), arg1)
end
The problem is the insertion of the Ptr{T}. As an example, here are two expressions that look the same:
julia> expr_from_interpreter
:(function ccall_jl_array_ptr(arg1, ::Type{T}) where T
return ccall(:jl_array_ptr, Ptr{T}, (Any,), arg1)
end)
julia> expr_copy_paste
:(function ccall_jl_array_ptr(arg1, ::Type{T}) where T
return ccall(:jl_array_ptr, Ptr{T}, (Any,), arg1)
end)
However, looking a bit closer:
julia> dump(expr_from_interpreter.args[2].args[2].args[1].args[3])
Ptr{T} <: Ref{T}
julia> dump(expr_copy_paste.args[2].args[2].args[1].args[3])
Expr
head: Symbol curly
args: Array{Any}((2,))
1: Symbol Ptr
2: Symbol T
So the variable with free typevars Ptr{T} has in the first case been interpolated, but it doesn't "bind" to the T in the function definition. The result is:
julia> eval(expr_from_interpreter)
ccall_jl_array_ptr (generic function with 1 method)
julia> ccall_jl_array_ptr([1,2], Int64)
ERROR: ccall return type Ptr should have an element type (not Ptr{<:T})
Stacktrace:
[1] ccall_jl_array_ptr(::Array{Int64,1}, ::Type{Int64}) at C:\Users\Kristoffer\Debugging\JuliaInterpreter\src\optimize.jl:223
[2] top-level scope at none:0
Interesting. Not sure I fully understand, but can you recurse down and check for TypeVars? And replace them with the corresponding expression? (I may be off-base here, if so perhaps some kind of demo I can run?)
EDIT: all that would happen in optimize!, of course.
What is your opinion about pushing things to compiled_methods that are very commonly used but unlikely to be of interest for breakpoints.
For example, iterate(a::Array) (with Core_apply now being stepped through) makes us have to do
[1] promote_typeof(x) at promotion.jl:263
[2] promote_typeof(x, xs) at promotion.jl:264
[3] -(a, b) at int.jl:797
[4] iterate(A, i) at array.jl:705
[5] iterate(A) at array.jl:705
Is it a slippery slope to start doing this?
We could temporarily pop them from compiled_methods when someone tries to set a breakpoint in one of them. I think it does make sense to these kinds of methods by default.
I worry about that slope. Where do we stop? Will packages suddenly fill with @require JuliaInterpreter begin push!(...) end blocks?
I do propose special handling of iterate above, one that would effectively inline the calls to iterate for all containers, not just ::Array.
Also, if we start using compiled frames, one thought: I think there are only two classes of slow-to-interpret methods: ones that iterate and ones that recurse. If we can detect those specifically then we might save ourselves compile time.
Here is a fun one:
julia> using JuliaInterpreter, Plots
julia> @time @interpret plot(rand(5))
5.703063 seconds (39.36 M allocations: 2.545 GiB, 6.56% gc time)
vs
julia> using JuliaInterpreter, CSV, Plots # note the CSV load
julia> @time @interpret plot(rand(5))
16.160595 seconds (51.27 M allocations: 2.868 GiB, 3.46% gc time)
Is it all method-table searches? (Basically, whichtt or related?)
I haven't profiled yet, just observed it.
I think this is the problem:
julia> using BenchmarkTools
julia> c = Some{Any}("foo")
Some("foo")
julia> @code_warntype convert(Some{Any}, c)
Body::Some{Any}
1 ─ %1 = (Base.getfield)(x, :value)::Any
│ %2 = %new(Some{Any}, %1)::Some{Any}
└── return %2
julia> @btime convert(Some{Any}, $c)
4.299 ns (1 allocation: 16 bytes)
Some("foo")
julia> using CSV
julia> @code_warntype convert(Some{Any}, c)
Body::Some{Any}
1 ─ %1 = $(Expr(:static_parameter, 1))::Core.Compiler.Const(Any, false)
│ %2 = (Base.getfield)(x, :value)::Any
│ %3 = (Base.convert)(%1, %2)::Any
│ %4 = (Some{Any})(%3)::Some{Any}
└── return %4
julia> @btime convert(Some{Any}, $c)
923.077 ns (1 allocation: 16 bytes)
Some("foo")
With https://github.com/JuliaLang/julia/pull/31536 we are at
julia> @time @interpret plot(rand(5))
2.930957 seconds (13.54 M allocations: 481.296 MiB, 2.31% gc time)
for the non CSV case and
julia> @time @interpret plot(rand(5))
5.765768 seconds (15.13 M allocations: 507.931 MiB, 1.52% gc time)
with CSV. So much better but still signficiantly slower.
Changing junk from a Set back to a Vector I get
julia> @time (p = @interpret(plot(rand(5))); @interpret(display(p)))
9.845403 seconds (64.34 M allocations: 2.011 GiB, 3.19% gc time)
vs master
julia> @time (p = @interpret(plot(rand(5))); @interpret(display(p)))
12.480055 seconds (66.37 M allocations: 2.219 GiB, 2.56% gc time)
We changed this when we found the bug when I executed the same snippet twice and managed to get two of the same frames into the Vector. However, perhaps we can be smarter here than just uses a full blown IdSet. We don't end up with that many frames in junk so we should be able to exploit that. Also, can it ever happen that the same Frame end up in junk if we empty it when we enter create the "first" frame from toplevel?
:+1: In theory it seems possible to design the logic such that we never put the same frame in twice. In that case we can just use Vector without any kind of checking. For development purposes, we might want to start with checking and an assertion, and then strip that out when we find it's no longer necessary.