JuliaInterpreter.jl WIP: a new serialization format for optimizing & executing lowered IR

This is a draft of a new tokenization of lowered IR. The main goal is to split off from the representation currently used in Base so that we are free to perform more significant transformations of the IR to enable more performance optimizations.

This PR supports serialization, printing, and very limited execution. Lots is manual, and lots is broken, but I wanted to post it to collect early feedback. (This is built on top of #307 so this PR contains lots of irrelevant changes. To start with you should focus just on two files, src/serializer.jl and the demo in test/serialization.jl, or just look at the commits individually.)

A demo

In test/ there's a demo script, serialization.jl. If you run it you get this output:

  1: call zero methlist listargs <vec of length 1> loadparameter 1 
 32: storeslot 3 
 41: loadslot 2 
 50: storessa 2 
 59: call iterate methlist listargs <vec of length 1> loadssa 2 
 90: storeslot 4 
 99: callbuiltin === fixedargs 2 loadslot 4 nothingtok 
116: storessa 4 
125: callintrinsic xor_int listargs <vec of length 1> loadssa 4 
148: storessa 5 
157: gotoifnot loadssa 5 444
175: loadslot 4 
184: storessa 7 
193: callbuiltin getfield fixedargs 2 loadssa 7 int 1 
218: storeslot 5 
227: callbuiltin getfield fixedargs 2 loadssa 7 int 2 
252: storessa 9 
261: call + methlist listargs <vec of length 2> loadslot 3 loadslot 5 
301: storeslot 3 
310: call iterate methlist listargs <vec of length 2> loadssa 2 loadssa 9 
350: storeslot 4 
359: callbuiltin === fixedargs 2 loadslot 4 nothingtok 
376: storessa 12 
385: callintrinsic xor_int listargs <vec of length 1> loadssa 12 
408: storessa 13 
417: gotoifnot loadssa 13 444
435: goto 175
444: returntok loadslot 3 
99

The 99 is a consequence of the last few lines, stepping forward until you get to the line that begins 99: callbuiltin... (builtins are not yet supported). This comes from the lowered code for summer, which looks like this:

julia> src
CodeInfo(
1 ─       s = Main.zero($(Expr(:static_parameter, 1)))
│   %2  = A
│         @_4 = Base.iterate(%2)
│   %4  = @_4 === nothing
│   %5  = Base.not_int(%4)
└──       goto #4 if not %5
2 ┄ %7  = @_4
│         a = Core.getfield(%7, 1)
│   %9  = Core.getfield(%7, 2)
│         s = s + a
│         @_4 = Base.iterate(%2, %9)
│   %12 = @_4 === nothing
│   %13 = Base.not_int(%12)
└──       goto #4 if not %13
3 ─       goto #2
4 ┄       return s
)

From this you should be able to learn a lot about the serialization format I've designed, but for the benefit of all I've reproduced below the extended comments that appear at the beginning of src/serializer.jl:

A brief description of the serialization format

This uses a simple format, conceptually implementing a machine with the following properties:

a tape of instructions called ser
a single implicit "register" called ans
the ability to execute operations specific to a particular instruction token in ser. Executing these operations may consume (nondestructively) future tokens.

Operations are conceptually of 4 categories:

loads (which fill ans from a variety of sources), encoded by load* or literal tokens
stores (which put ans somewhere more permanent), encoded by store* tokens
calls (for which the return value is stored in ans)
control-flow

The implementation of calls is allowed to store data to named local variables or lists, thus increasing the temporary storage beyond ans.

The serialization of the lowered IR

   %4 = atan(@3, 2.4)

might look something like this on the tape:

   call atan_idx methlist fixedargs 2 loadslot 3 float64 2.4
   storessa 4

where

call is an instruction token signaling that next operation is a function call (in reality, there are multiple call-type tokens for intrinsics, builtins, generics via the interpreter, generics via ordinary compiled dispatch, invokelatest, Core._apply, etc.)
atan_idx is a token representing atan. It is encoded as an integer index into a table of functions (the table is maintained by the serializer)
methlist is a pointer to a local method table, a performance optimization used for avoiding full-blown dispatch (this also stores whether the method should be called via the interpreter or the compiled path)
fixedargs 2 is an indication that this call should use the path optimized for a particular (small) number of arguments, which in this case is 2. An alternative is listargs args, which packs an arbitrary number of arguments into a literally-encoded args::Vector{Any} stored (via its pointer) in ser. Contrary to this (fictitious) example, currently only builtins exploit fixedargs since they are the only ones that will typically need runtime dispatch.
loadslot 3 indicates that the next argument (first argument) is to be loaded from the slots at index 3
float64 2.4 indicates that the next argument (second argument) is a literal value of type Float64 encoded in ser itself.
after the second argument, the function call is executed and the result is stored in ans
storessa 4 indicates that ans should be placed in %4.

Call sites that use listargs currently have their own private args vector sized appropriately for that particular call site. Having one per site is almost certainly overkill. A better format would be to have framecode construction figure out what sizes the method needs and then have the framedata allocate a pool of different sizes. This would support multithreaded interpretation while also decreasing the total storage size. This would probably be one of the most urgent changes to make.

Potential performance benefits

Don't even think about timing things yet, since it's not finished enough to be meaningful. We need to support builtins/intrinsics and implement recursive calls.

But the serialization format already has some potential performance benefits. For example, one of our hottest methods is maybe_evaluate_builtin, which gets called even when f is not a builtin. With the new format, we decide "at compile time" (framecode construction time) whether f is a builtin or something else and then "dispatch" (a big if/then block) to the appropriate method.

In the longer run, as mentioned above we may be able to perform optimizations that would be incompatible with the tools we rely on for handling lowered IR. For example, the section

175: loadslot 4 
184: storessa 7 
193: callbuiltin getfield fixedargs 2 loadssa 7 int 1 
218: storeslot 5 
227: callbuiltin getfield fixedargs 2 loadssa 7 int 2 
252: storessa 9

might be simplified using a couple of new tokens as something like the following:

175: loadslot 4 
184: storevec storessa 7 getfieldtok 1 storeslot 5 getfieldtok 2 storessa 9

and we might be able to implement an optimized method that avoids having to create new frames in the default cases.

I don't know exactly where we want to head with this, but this might at least illustrate some possibilities.

The future

Just getting this far required that I put slightly more time into this than I can afford, so consider this post to be an invitation for others to run with it if interested. I've named the branch serialization rather than teh/serialization to explicitly disavow ownership. I'm well aware others will have their own agenda too, so if no one grabs it then it can just wait until I have time for it again. But that could be a while.

It's also worth noting that JeffB mentioned at JuliaCon that he had been wondering about doing something similar, so at some point (once this has been developed a bit further) we should probably ping him and see what he thinks of the format. It might be nice to share one format (at least in non-optimized form) between base & this package.

EDITS

I'd now change several things about this, esp. not storing pointers inline with the code. Just use a single long "stack" for all frames and pass in an offset when you enter into a new frame.

Jul 29 '19 16:07 timholy

Really cool. Should we already now swap out the TypeMapEntry stuff for the DispatchableMethod in here? Seems easier and potentially faster?

Jul 29 '19 22:07 KristofferC

Quite likely. If you're interested, feel free to make changes to the struct. The goal is to make things as easy as possible. One thought would also be to organize these differently: for a call with two arguments, rather than having a list like this

Tuple{Int, Int} => meth1
Tuple{Int16,Int} => meth1
Tuple{Float64,Int} => meth2
Tuple{Int,Float64} => meth3

one might instead consider a tree:

├──first arg is one of Union{Int,Int16}
|   ├──second arg isa Int => meth1
|   ├──second arg isa Float64 => meth3
├──first arg isa Float64
|   ├──second arg isa Int => meth2

It's not obvious that this is a good idea, since I bet >90% of all call sites inside loops (which is basically all we care about) dispatch to a single method, and it seems likely to be hard to beat the list in that case.

In general, writing this up was a good exercise in re-thinking "where should this bit of information go?" An example includes the compiled flag in DispatchableMethod; currently we allow either Compiled() or a TypeMapEntry to be stored in the local method table, but really I think we want both: when a given function has some methods that should be run in interpreted mode and others that must be run in compiled mode, you'd like to be able to exploit a local method table to quickly figure out which method is applicable and then run it in the appropriate mode. (I'm actually not quite sure what happens in that case currently, it might be worth testing.)

Other good ideas that could potentially be "stolen" from this and applied to our current infrastructure include checking whether something is a builtin/intrinsic at the time of framecode construction, rather than having to call maybe_evaluate_builtin on each iteration of a loop. We can't write it into the serialized code, but we could add yet another array to the framecode. Builtins & intrinsics don't support multiple methods, so this is something that can be checked from the expression alone.

Jul 30 '19 10:07 timholy

Should we define some "calling convention" for this serialized IR? So that the function that gets called know how to look up the arguments without having to put them into a vector, pass it to the next frame and push them into locals?

A bit related to the comment at https://github.com/JuliaDebug/JuliaInterpreter.jl/pull/308#issuecomment-516365401.

Jul 30 '19 15:07 KristofferC

Is the reason for storing the ssastores explicitly due to the fact that some ssavalues not being used (is_used)? Otherwise, they could potentially just be implicit since every codeinfo statement stores to its corresponding ssaslot.

Jul 30 '19 16:07 KristofferC

Should we define some "calling convention" for this serialized IR? So that the function that gets called know how to look up the arguments without having to put them into a vector, pass it to the next frame and push them into locals?

A bit related to the comment at #308 (comment).

Yes, I think if we do it the way outlined in that comment, that basically becomes the calling convention.

Jul 30 '19 18:07 timholy

Is the reason for storing the ssastores explicitly due to the fact that some ssavalues not being used (is_used)? Otherwise, they could potentially just be implicit since every codeinfo statement stores to its corresponding ssaslot.

One option here is to renumber them so that you only have as many SSAValues as you use. See https://github.com/JuliaDebug/JuliaInterpreter.jl/blob/f9ebc94b5d0c649a55621912494309837cf9e828/src/interpret.jl#L6 and https://github.com/JuliaDebug/JuliaInterpreter.jl/blob/f9ebc94b5d0c649a55621912494309837cf9e828/src/interpret.jl#L537-L542

It's probably a slowdown to check this for every single statement. It's also a slowdown to save results to ssavalues that won't be used. If you can rewrite the IR it seems you can have the best of both worlds.

Jul 30 '19 18:07 timholy