clay
clay copied to clipboard
Investigate lambda compilation performance
Lambdas might be a source of compiler performance problems. The parsing.combinators library makes heavy use of them and compiles impractically slowly, and the test
module used lambdas internally and also suffered compiler performance issues until being changed by @agemogolk not to use lambdas.
I have distmonad which uses lambdas heavily. It takes 8 seconds to compile at my machine and I tried to reduce that time.
Lambdas are converted this way now.
- This code
foo(x) = y -> {println(x);println(y)};
main(){
foo(1)("x");
foo("x")(1);
foo(1)(1);
foo("x")("x");
}
- After
initializeLambda
inlambdas.cpp
becomes this
record LambdaInt(x: Int);
record LambdaStr(x: Str);
foo(x : Int) = LambdaInt(x);
foo(x : String) = LambdaStr(x);
overload call(closure : LambdaInt, y) {println(closure.x^); println(y);}
overload call(closure : LambdaStr, y) {println(closure.x^); println(y);}
- Finally each call is codegened to two functions while
main
is codegenning:
record LambdaInt(x: Int);
record LambdaStr(x: Str);
foo_int(x : Int) = LambdaInt(x);
foo_str(x : String) = LambdaStr(x);
call1(closure : LambdaInt, y : Int) {println_int(closure.x^); println_int(y);}
call2(closure : LambdaStr, y : Int) {println_str(closure.x^); println_int(y);}
call3(closure : LambdaInt, y : String) {println_int(closure.x^); println_str(y);}
call4(closure : LambdaStr, y : String) {println_str(closure.x^); println_str(y);}
main(){
call3(foo_int(1), "x");
call2(foo_str("x"), 1);
call1(foo_int(1), 1);
call4(foo_str("x"), "x");
}
'x'->'closure.x^' conversion needs AST-modification. It happens in convertFreeVars
in lambdas.cpp
. I tried to get rid of ast modification by adding a line to call body after @jckarter's advice:
overload call(Lambda closure, y){
forward x = ..unpackMultiValuedFreeVarAndDereference(closure);
...lambdabody as is: println(x); println(y); ...
}
It didn't help to improve perfomance. I also tried to lookup freevars one time for each lambda by adding a memoization somewhere here. It also didn't help, still 8 seconds to compile.
It seems that lambdas.cpp
module isn't the reason of pure lambda compilation perfomance. I'm not very experienced in profiling, but callgrind shows that initializeLambda
is called 100 times during distmonad
compilation and initCallable
is called 10000 times. But maybe it's only distmonad
results and parse combinators lib would show other results. If someone could profile more accurately it would be nice. Ast-modification can be disabled with -only-overloaded-lambdas
command line flag in this branch
@galchinsky please could you post the -timing
output for your distmonad example, i'm wondering whether the long compile time is down to the optimizer. Ta.
$ clay -timing main.clay
load time = 2307 ms
compile time = 8080 ms
optimization time = 7603 ms
codegen time = 1547 ms
$ clay -timing main.clay -only-overloaded-lambdas
load time = 2270 ms
compile time = 8463 ms
optimization time = 7981 ms
codegen time = 1643 ms
Ah ok, i'd assumed you meant 8 secs for the whole thing. I did a little bit of profiling on the compiler innards a while ago and found no significant issues with actual lambda processing. What I did find is that around 30% - 90% of 'compile' time is spent evaluating predicates (based on some of test/examples). The higher percentages coming from heavily generic code such as parsing/combinators/generic, remote/marshalling, and printing/repr. Maybe the poor performance of lambda based code is related . . . ?
e.g.
> compiler/clay ../test/lib-clay/printing/repr/test.clay -timing
load time = 141 ms
- pred time = 37 ms
compile time = 1806 ms
- pred time = 1698 ms
optimization time = 2074 ms
codegen time = 665 ms
How can I turn on the pred time
option in timing? I'd like to test distmonad.
Is predicate evaluation done by interpreting? If it is, maybe there's a way to JIT it?
pred time isn't an actual option, it's from my own hacky-profiling branch. Very easy to implement though, just wrap a HiResTimer
around the predicate evaluation bit of matchInvoke
. I can push the branch if you want to try merging it . . . ?
I've put up my 'profile' branch.
I haven't delved too deeply into the evaluator, maybe @jckarter would know whether JITing is an option.
Yep, I requested for the branch.
JIT could speed up some complicated predicates if they are interpreted now. I don't know about usual cases like a list of CallDefined?
@kssreeram originally wrote the compile-time evaluator using the JIT, but the added overhead of compilation actually made it slower than the current interpreter at the time. IIRC his implementation was fairly naive, and if you were smarter about caching generated IR or machine code you might be able to improve performance over the interpreter. The current interpreter, however, has what I think are some relatively low-hanging fruit that could help with performance. For one, all values are allocated and refcounted from the heap, even those with stack extent. Allocating and deallocating variables with local extent off a special stack would reduce a bit of overhead there. Another thing that could help would be to partially compile by flattening the AST into a CFG-like structure and interpret the CFG instead of interpreting by walking the AST directly.
$ clay -timing main.clay
load time = 2318 ms
- pred time = 514 ms
compile time = 7717 ms
- pred time = 7934 ms
optimization time = 6430 ms
codegen time = 1349 ms
A little surprising that compile pred time is greater than actual compile time. Maybe my overly simple timing mechanism is faulty . . .
Added very stupid memoization to matchInvoke
(just put string representation of expr+env to String -> bool
map, to avoid evaluateBool calls with the same params https://gist.github.com/4051770#L64) , won 300 msec.
compile time = 7387 ms
- pred time = 7530 ms
compile time = 7669 ms
- pred time = 7887 ms
matchInvoke is called 12609 times. Load from map happens 3570 of 12609. Probably smarter caching could help.
Questions. Is every function pure in compile time? Is result of Something?(CertainType)
cached or reinterpreted in every predicate which has Something?(T)
?
Technically no, compile-time functions can in theory access global state, though in practice this doesn't really work. You could think of compile-time functions as being pure in their inputs and their modified inputs and outputs, for instance, inc(x)
will always result in modifying x
to the same value for any input x
. There's already some equally lame caching in analysis for boolean predicates, so having the same thing for the evaluator would at least not make things worse.
@galchinsky the obvious flaw in the timing is possible recursive call to matchInvoke. May be significant.
I fixed the timing bug and my previous example now gives (not using your caching):
> compiler/clay ../test/lib-clay/printing/repr/test.clay -timing
load time = 130 ms
- pred time = 15 ms
compile time = 1773 ms
- pred time = 1016 ms
optimization time = 2097 ms
codegen time = 663 ms
Worst case now seems to be around 60% of compile time, more reasonable, but still quite significant.
It is ~4600 ms after fixing, caching reduces the time to ~4300. 6% not so bad. I made a RAII-version of a timer