clay Investigate lambda compilation performance

Lambdas might be a source of compiler performance problems. The parsing.combinators library makes heavy use of them and compiles impractically slowly, and the test module used lambdas internally and also suffered compiler performance issues until being changed by @agemogolk not to use lambdas.

Sep 05 '12 16:09 jckarter

I have distmonad which uses lambdas heavily. It takes 8 seconds to compile at my machine and I tried to reduce that time.

Lambdas are converted this way now.

This code

foo(x) = y -> {println(x);println(y)};
main(){
    foo(1)("x");
    foo("x")(1);
    foo(1)(1);
    foo("x")("x");
}

After initializeLambda in lambdas.cpp becomes this

record LambdaInt(x: Int);
record LambdaStr(x: Str);

foo(x : Int) = LambdaInt(x);
foo(x : String) = LambdaStr(x);

overload call(closure : LambdaInt, y) {println(closure.x^); println(y);}
overload call(closure : LambdaStr, y) {println(closure.x^); println(y);}

Finally each call is codegened to two functions while main is codegenning:

record LambdaInt(x: Int);
record LambdaStr(x: Str);

foo_int(x : Int) = LambdaInt(x);
foo_str(x : String) = LambdaStr(x);

call1(closure : LambdaInt, y : Int) {println_int(closure.x^); println_int(y);}
call2(closure : LambdaStr, y : Int) {println_str(closure.x^); println_int(y);}
call3(closure : LambdaInt, y : String) {println_int(closure.x^); println_str(y);}
call4(closure : LambdaStr, y : String) {println_str(closure.x^); println_str(y);}

main(){
call3(foo_int(1), "x");
call2(foo_str("x"), 1);
call1(foo_int(1), 1);
call4(foo_str("x"), "x");
}

'x'->'closure.x^' conversion needs AST-modification. It happens in convertFreeVars in lambdas.cpp. I tried to get rid of ast modification by adding a line to call body after @jckarter's advice:

overload call(Lambda closure, y){
forward x = ..unpackMultiValuedFreeVarAndDereference(closure);
...lambdabody as is: println(x); println(y); ...
}

It didn't help to improve perfomance. I also tried to lookup freevars one time for each lambda by adding a memoization somewhere here. It also didn't help, still 8 seconds to compile.

It seems that lambdas.cpp module isn't the reason of pure lambda compilation perfomance. I'm not very experienced in profiling, but callgrind shows that initializeLambda is called 100 times during distmonad compilation and initCallable is called 10000 times. But maybe it's only distmonad results and parse combinators lib would show other results. If someone could profile more accurately it would be nice. Ast-modification can be disabled with -only-overloaded-lambdas command line flag in this branch

Nov 06 '12 14:11 galchinsky

@galchinsky please could you post the -timing output for your distmonad example, i'm wondering whether the long compile time is down to the optimizer. Ta.

Nov 09 '12 19:11 ghost

$ clay -timing main.clay
load time = 2307 ms
compile time = 8080 ms
optimization time = 7603 ms
codegen time = 1547 ms
$ clay -timing main.clay -only-overloaded-lambdas
load time = 2270 ms
compile time = 8463 ms
optimization time = 7981 ms
codegen time = 1643 ms

Nov 09 '12 19:11 galchinsky

Ah ok, i'd assumed you meant 8 secs for the whole thing. I did a little bit of profiling on the compiler innards a while ago and found no significant issues with actual lambda processing. What I did find is that around 30% - 90% of 'compile' time is spent evaluating predicates (based on some of test/examples). The higher percentages coming from heavily generic code such as parsing/combinators/generic, remote/marshalling, and printing/repr. Maybe the poor performance of lambda based code is related . . . ?

e.g.

> compiler/clay ../test/lib-clay/printing/repr/test.clay -timing
load time = 141 ms
    - pred time = 37 ms
compile time = 1806 ms
    - pred time = 1698 ms
optimization time = 2074 ms
codegen time = 665 ms

Nov 09 '12 20:11 ghost

How can I turn on the pred time option in timing? I'd like to test distmonad.

Is predicate evaluation done by interpreting? If it is, maybe there's a way to JIT it?

Nov 09 '12 21:11 galchinsky

pred time isn't an actual option, it's from my own hacky-profiling branch. Very easy to implement though, just wrap a HiResTimer around the predicate evaluation bit of matchInvoke. I can push the branch if you want to try merging it . . . ?

I've put up my 'profile' branch.

Nov 09 '12 21:11 ghost

I haven't delved too deeply into the evaluator, maybe @jckarter would know whether JITing is an option.

Nov 09 '12 21:11 ghost

Yep, I requested for the branch.

JIT could speed up some complicated predicates if they are interpreted now. I don't know about usual cases like a list of CallDefined?

Nov 09 '12 21:11 galchinsky

@kssreeram originally wrote the compile-time evaluator using the JIT, but the added overhead of compilation actually made it slower than the current interpreter at the time. IIRC his implementation was fairly naive, and if you were smarter about caching generated IR or machine code you might be able to improve performance over the interpreter. The current interpreter, however, has what I think are some relatively low-hanging fruit that could help with performance. For one, all values are allocated and refcounted from the heap, even those with stack extent. Allocating and deallocating variables with local extent off a special stack would reduce a bit of overhead there. Another thing that could help would be to partially compile by flattening the AST into a CFG-like structure and interpret the CFG instead of interpreting by walking the AST directly.

Nov 10 '12 01:11 jckarter

$ clay -timing main.clay
load time = 2318 ms
        - pred time = 514 ms
compile time = 7717 ms
        - pred time = 7934 ms
optimization time = 6430 ms
codegen time = 1349 ms

Nov 10 '12 07:11 galchinsky

A little surprising that compile pred time is greater than actual compile time. Maybe my overly simple timing mechanism is faulty . . .

Nov 10 '12 12:11 ghost

Added very stupid memoization to matchInvoke (just put string representation of expr+env to String -> bool map, to avoid evaluateBool calls with the same params https://gist.github.com/4051770#L64) , won 300 msec.

compile time = 7387 ms
        - pred time = 7530 ms

compile time = 7669 ms
        - pred time = 7887 ms

matchInvoke is called 12609 times. Load from map happens 3570 of 12609. Probably smarter caching could help.

Questions. Is every function pure in compile time? Is result of Something?(CertainType) cached or reinterpreted in every predicate which has Something?(T)?

Nov 10 '12 17:11 galchinsky

Technically no, compile-time functions can in theory access global state, though in practice this doesn't really work. You could think of compile-time functions as being pure in their inputs and their modified inputs and outputs, for instance, inc(x) will always result in modifying x to the same value for any input x. There's already some equally lame caching in analysis for boolean predicates, so having the same thing for the evaluator would at least not make things worse.

Nov 10 '12 17:11 jckarter

@galchinsky the obvious flaw in the timing is possible recursive call to matchInvoke. May be significant.

Nov 10 '12 22:11 ghost

I fixed the timing bug and my previous example now gives (not using your caching):

> compiler/clay ../test/lib-clay/printing/repr/test.clay -timing
load time = 130 ms
        - pred time = 15 ms
compile time = 1773 ms
        - pred time = 1016 ms
optimization time = 2097 ms
codegen time = 663 ms

Worst case now seems to be around 60% of compile time, more reasonable, but still quite significant.

Nov 10 '12 22:11 ghost

It is ~4600 ms after fixing, caching reduces the time to ~4300. 6% not so bad. I made a RAII-version of a timer

Nov 11 '12 09:11 galchinsky

clay clay copied to clipboard

Investigate lambda compilation performance

clay
clay copied to clipboard