JuliaInterpreter.jl Performance opportunities

In my initial performance work, the big 3 improvements were local method tables (framecode.methodtables), reusing framedata (the junk mechanism), and getting rid of runtime dispatch internally within the interpreter (by manual union-splitting, aka, if isa(x, ConcreteType)...).

Some possible future improvements:

[ ] Recent changes diving into Core._apply, Core._apply_latest, and invoke hurt performance. Perhaps some of that might be clawed back by moving more of this work into optimize!?
[x] ccalls are slow. I just pushed a teh/compiled_ccall that contains an old attempt to create one compiled function per ccall. I was most interested in seeing whether it would circumvent #28 (it didn't seem to), but I don't think I looked at its runtime implications (esp on a ccall-heavy workload). I suspect that branch is pretty close to usable, if anyone wants to pick it up and run with it.
[ ] specifically look for calls to iterate and do something special, a.k.a., (1) determine whether iterate could possibly hit a breakpoint or error (if not then you're allowed to play tricks), (2) see if you can determine whether the loop will be long, and if so (3) pass the iterator and frame to a special run_loop function. That would have to compile for each iterator type, so is worth doing only if the loop is long. But you would save the time spent diving in and out of iterate, and that could be quite helpful.
[x] apply similar treatment to arrayrefs. Even our builtin handling of arrayref could be improved by separately treating small dimensions (e.g., if nargs == 1 ... elseif nargs == 2 ... elseif nargs == 3 ... else <varargs version> end.

Obviously spending some time with ProfileView will be useful. I did a ton of that at the beginning and fixed most everything I that I noticed. But I haven't done it in a while and we may have regressed in places as we expanded our functionality.

Mar 22 '19 11:03 timholy

I took a stab at teh/compiled_ccall and where I am kinda stuck at now is for functions that look like

unsafe_convert(::Type{Ptr{T}}, a::Array{T}) where {T} = ccall(:jl_array_ptr, Ptr{T}, (Any,), a)

I try to generate something like

unsafe_convert(::Type{Ptr{T}}, a::Array{T}) where {T} = JuliaInterpreter.CompiledCalls.ccall_jl_array_ptr(a, T)

function JuliaInterpreter.CompiledCalls.ccall_jl_array_ptr(arg1, ::Type{T}) where {T}
    return ccall(:jl_array_ptr, Ptr{T}, (Any,), arg1)
end

The problem is the insertion of the Ptr{T}. As an example, here are two expressions that look the same:

julia> expr_from_interpreter
:(function ccall_jl_array_ptr(arg1, ::Type{T}) where T
      return ccall(:jl_array_ptr, Ptr{T}, (Any,), arg1)
  end)

julia> expr_copy_paste
:(function ccall_jl_array_ptr(arg1, ::Type{T}) where T
      return ccall(:jl_array_ptr, Ptr{T}, (Any,), arg1)
  end)

However, looking a bit closer:

julia> dump(expr_from_interpreter.args[2].args[2].args[1].args[3])
Ptr{T} <: Ref{T}

julia> dump(expr_copy_paste.args[2].args[2].args[1].args[3])
Expr
  head: Symbol curly
  args: Array{Any}((2,))
    1: Symbol Ptr
    2: Symbol T

So the variable with free typevars Ptr{T} has in the first case been interpolated, but it doesn't "bind" to the T in the function definition. The result is:

julia> eval(expr_from_interpreter)
ccall_jl_array_ptr (generic function with 1 method)

julia> ccall_jl_array_ptr([1,2], Int64)
ERROR: ccall return type Ptr should have an element type (not Ptr{<:T})
Stacktrace:
 [1] ccall_jl_array_ptr(::Array{Int64,1}, ::Type{Int64}) at C:\Users\Kristoffer\Debugging\JuliaInterpreter\src\optimize.jl:223
 [2] top-level scope at none:0

Mar 26 '19 11:03 KristofferC

Interesting. Not sure I fully understand, but can you recurse down and check for TypeVars? And replace them with the corresponding expression? (I may be off-base here, if so perhaps some kind of demo I can run?)

EDIT: all that would happen in optimize!, of course.

Mar 26 '19 12:03 timholy

What is your opinion about pushing things to compiled_methods that are very commonly used but unlikely to be of interest for breakpoints.

For example, iterate(a::Array) (with Core_apply now being stepped through) makes us have to do

[1] promote_typeof(x) at promotion.jl:263
[2] promote_typeof(x, xs) at promotion.jl:264
[3] -(a, b) at int.jl:797
[4] iterate(A, i) at array.jl:705
[5] iterate(A) at array.jl:705

Is it a slippery slope to start doing this?

Mar 28 '19 12:03 KristofferC

We could temporarily pop them from compiled_methods when someone tries to set a breakpoint in one of them. I think it does make sense to these kinds of methods by default.

Mar 28 '19 13:03 pfitzseb

I worry about that slope. Where do we stop? Will packages suddenly fill with @require JuliaInterpreter begin push!(...) end blocks?

Mar 28 '19 13:03 timholy

I do propose special handling of iterate above, one that would effectively inline the calls to iterate for all containers, not just ::Array.

Also, if we start using compiled frames, one thought: I think there are only two classes of slow-to-interpret methods: ones that iterate and ones that recurse. If we can detect those specifically then we might save ourselves compile time.

Mar 28 '19 14:03 timholy

Here is a fun one:

julia> using JuliaInterpreter, Plots

julia> @time @interpret plot(rand(5))
  5.703063 seconds (39.36 M allocations: 2.545 GiB, 6.56% gc time)

vs

julia> using JuliaInterpreter, CSV, Plots # note the CSV load

julia> @time @interpret plot(rand(5))
 16.160595 seconds (51.27 M allocations: 2.868 GiB, 3.46% gc time)

Mar 28 '19 16:03 KristofferC

Is it all method-table searches? (Basically, whichtt or related?)

Mar 28 '19 17:03 timholy

I haven't profiled yet, just observed it.

Mar 28 '19 17:03 KristofferC

I think this is the problem:

julia> using BenchmarkTools

julia> c = Some{Any}("foo")
Some("foo")

julia> @code_warntype convert(Some{Any}, c)
Body::Some{Any}
1 ─ %1 = (Base.getfield)(x, :value)::Any
│   %2 = %new(Some{Any}, %1)::Some{Any}
└──      return %2

julia> @btime convert(Some{Any}, $c)
  4.299 ns (1 allocation: 16 bytes)
Some("foo")

julia> using CSV

julia> @code_warntype convert(Some{Any}, c)
Body::Some{Any}
1 ─ %1 = $(Expr(:static_parameter, 1))::Core.Compiler.Const(Any, false)
│   %2 = (Base.getfield)(x, :value)::Any
│   %3 = (Base.convert)(%1, %2)::Any
│   %4 = (Some{Any})(%3)::Some{Any}
└──      return %4

julia> @btime convert(Some{Any}, $c)
  923.077 ns (1 allocation: 16 bytes)
Some("foo")

Mar 29 '19 09:03 KristofferC

With https://github.com/JuliaLang/julia/pull/31536 we are at

julia>  @time @interpret plot(rand(5))
  2.930957 seconds (13.54 M allocations: 481.296 MiB, 2.31% gc time)

for the non CSV case and

julia>  @time @interpret plot(rand(5))
  5.765768 seconds (15.13 M allocations: 507.931 MiB, 1.52% gc time)

with CSV. So much better but still signficiantly slower.

Mar 29 '19 10:03 KristofferC

Changing junk from a Set back to a Vector I get

julia> @time (p = @interpret(plot(rand(5))); @interpret(display(p)))
  9.845403 seconds (64.34 M allocations: 2.011 GiB, 3.19% gc time)

vs master

julia> @time (p = @interpret(plot(rand(5))); @interpret(display(p)))
 12.480055 seconds (66.37 M allocations: 2.219 GiB, 2.56% gc time)

We changed this when we found the bug when I executed the same snippet twice and managed to get two of the same frames into the Vector. However, perhaps we can be smarter here than just uses a full blown IdSet. We don't end up with that many frames in junk so we should be able to exploit that. Also, can it ever happen that the same Frame end up in junk if we empty it when we enter create the "first" frame from toplevel?

Mar 30 '19 10:03 KristofferC

:+1: In theory it seems possible to design the logic such that we never put the same frame in twice. In that case we can just use Vector without any kind of checking. For development purposes, we might want to start with checking and an assertion, and then strip that out when we find it's no longer necessary.

Mar 30 '19 10:03 timholy

JuliaInterpreter.jl JuliaInterpreter.jl copied to clipboard

Performance opportunities

JuliaInterpreter.jl
JuliaInterpreter.jl copied to clipboard