mir icon indicating copy to clipboard operation
mir copied to clipboard

JIT compilation speed between MIR and LLVM

Open xujuntwt95329 opened this issue 4 years ago • 21 comments

Hi,

I'm greatly attracted by the 100x faster compilation speed shown in your post as we need a fast and lightweight JIT compiler in our wasm-micro-runtime. Seems the speed comparison is for the entire compilation period (including start up and many other phases)? Do you have any comparison data about the time consumption from IR to machine code between MIR(MIR --> machine code) and LLVM(LLVMIR --> machine code) ?

Recently I have done some work to integrate MIR into wasm-micro-runtime. I use llvm2mir to convert llvmir to MIR, and most of my cases work well. However, I find that the compilation speed of MIR is not noticeably improved:

compiler framework compilation time (ms)
LLVM 23339
LLVM (with FastISel) 13040
MIR 21979

The MIR is only 1.06x faster than LLVM and even slower than LLVM with FastISel option.

The llvmir->mir convertion time is not calculated in this data

So I've got a few questions:

  1. The data I get from my case is far away from "100x faster than GCC", I wonder if my data is possible, or is it possibly that I did something wrong to cause this? I think the most time-consuming part during the compilation is the optimization passes, so I was expecting MIR codegen (including optimization) can run nearly 100x faster than LLVM.
  2. As I know the wasmtime, which use Cranelift as its JIT compiler, can finish compilation for this case in 200ms, is there any chances for us to get such high compilation speed in MIR?

Thanks !

xujuntwt95329 avatar Apr 17 '20 14:04 xujuntwt95329

So I've got a few questions:

1. The data I get from my case is far away from `"100x faster than GCC"`,

I wonder if my data is possible, or is it possibly that I did something wrong to cause this?

Thanks for reporting this and starting this issue.

No I don't think you did something wrong.

It would be interesting to see your test case, then I can say more. It would be also useful to have this test (with mir and LLVM IR) to optimize MIR-generator algorithms.

My major motivation for MIR project is to implement Ruby tier1 JIT compiler. Most Ruby methods (including Ruby blocks) are small, just a few lines. It is translated into 10-20 lines of C. The current Ruby JIT is implemented by translation of Ruby method bytecode into C and then calling GCC/LLVM to get a machine code. It is very slow because of very big start up time of GCC/LLVM.

So my case I focus on is small code. I use sieve program to compare its compilation speed by GCC and MIR-generator and that is where I get 100x speed.

According to your data your code is probably 300 times more than sieve I use. On this size code startup time is not important. So you are comparing raw compilation speed. For raw compilation speed I think MIR will have worse data. Why? Because MIR-generator was designed for compilation of small functions (most common case) and to have very simple optimization algorithms (small source code vs speed). The same LLVM/GCC optimization should be always faster because their developers (including me) are spending huge amount of time to make it faster (on small and huge programs), to work for any edge case, etc. For example, GCC RA is 40K C lines vs 500 lines in MIR-generator. GCC RA uses so many different structures (e.g. different complicated presentations of conflict graph) to make it scalable and work fast on huge functions.

I think the most time-consuming part during the compilation is the optimization passes, so I was expecting MIR codegen (including optimization) can run nearly 100x faster than LLVM.

For compilation of small code it is not optimizations. I'd recommend to read my blog post: https://developers.redhat.com/blog/2020/01/20/mir-a-lightweight-jit-compiler-project/

or see my presentation on RubyKaigi2019.

2. As I know the `wasmtime`, which use `Cranelift` as its JIT compiler, can finish compilation

for this case in 200ms, is there any chances for us to get such high compilation speed in MIR?

I think to achieve this speed on your test case, some MIR optimizations should be changed and be more scalable and to be fast on big functions (although it will increase the MIR-generator source code significantly). May be I do something stupid for optimization of big function and it can be fixed easily. In any case having your testcase (what optimizations you used for LLVM) would help to analyze what should be done to improve MIR-generator speed on bigger code.

To be honest, I did not work seriously on improving MIR-generator speed and generated code performance yet. It is a serious work and I am just at the stage of writing initial version of working code.

vnmakarov avatar Apr 17 '20 18:04 vnmakarov

2\. As I know the `wasmtime`, which use `Cranelift` as its JIT compiler, can finish compilation for this case in 200ms, is there any chances for us to get such high compilation speed in MIR?

200ms means that MIR should be sped up in 100 times. It is hard for me to say how you measure this and what is your test case actually. Unless I do something stupid, I don't think MIR can be sped up in 100 times. Also I don't thin cranelift can be so faster than MIR and LLVM (w/o optimizations). I probably can speed up MIR in 3-4 times (2 times by using only RA and code selection and by speeding up code selection 2 times).

I am also interesting what optimizations you used for LLVM? Is it full set of optimizations (LLVM can switch on hundreds optimization passes)? My answer in the previous post assumes that you used LLVM w/o optimizations only with different code selection algorithms.

By the way did you use MIR-textual representation as input or started MIR-generator to work from MIR in memory?

vnmakarov avatar Apr 17 '20 22:04 vnmakarov

Many thanks for your detailed and wonderful explanation.

It would be interesting to see your test case, then I can say more. It would be also useful to have this test (with mir and LLVM IR) to optimize MIR-generator algorithms.

I've tried to save the textual llvm-ir and MIR, but they are too huge to upload, so I find a smaller one: test_case.zip

According to your data your code is probably 300 times more than sieve I use. On this size code startup time is not important. So you are comparing raw compilation speed. For raw compilation speed I think MIR will have worse data.

Yes, our IR is converted from wasm, so the size of the functions is determined by the source level language. Besides, to keep the sandbox mechanism for wasm, we need to insert some blocks in the IR to do some checks. I tried to output the llvm-ir bitcode of my case, it's more than 11MB which should be much bigger than sieve.

So, as per my understanding, the reason that MIR isn't much faster than LLVM on my case is:

  1. The function in my IR is too big so that the start up time is only a small part of the total compilation time, which weakening the advantage of MIR in start-up time
  2. Although there are fewer passes in MIR, the passes in LLVM runs much faster than MIR especially for big functions, so the speed increment of MIR compared to LLVM is not as high as expected

Is my understanding correct and complete?

200ms means that MIR should be sped up in 100 times. It is hard for me to say how you measure this and what is your test case actually. Unless I do something stupid, I don't think MIR can be sped up in 100 times.

This speed is really amazing. Maybe I can check this later.

I am also interesting what optimizations you used for LLVM? Is it full set of optimizations (LLVM can switch on hundreds optimization passes)? My answer in the previous post assumes that you used LLVM w/o optimizations only with different code selection algorithms.

This is my approach:

LLVM JIT:  wasm bytecode  -->  llvmir  -->  llvm codegen
MIR JIT:   wasm bytecode  -->  llvmir  -->  mir  -->  mir codegen

As for the LLVM-IR, we use un-optimized LLVM-IR in both LLVM and MIR codegen, because the optimized one will cause the register conflict problem in MIR's codegen. And for the codegen, we use LLVMCodeGenLevelAggressive for LLVM, which should be the highest level.

By the way did you use MIR-textual representation as input or started MIR-generator to work from MIR in memory?

I use the MIR in memory:

    mir_ctx = MIR_init ();
    mir_module = llvm2mir (mir_ctx, comp_ctx->module /* LLVM Module */);
    MIR_load_module (mir_ctx, mir_module);
    // ...

Thanks again !

xujuntwt95329 avatar Apr 18 '20 15:04 xujuntwt95329

I've tried to save the textual llvm-ir and MIR, but they are too huge to upload, so I find a smaller one: test_case.zip

Thank you for your test case. I really appreciate it. It is interesting. I did not expect that MIR will be used for such big code. It is 500K lines module with about 2-3K lines per function.

Yes, our IR is converted from wasm, so the size of the functions is determined by the source level language. Besides, to keep the sandbox mechanism for wasm, we need to insert some blocks in the IR to do some checks. I tried to output the llvm-ir bitcode of my case, it's more than 11MB which should be much bigger than sieve.

So, as per my understanding, the reason that MIR isn't much faster than LLVM on my case is:

1. The function in my IR is too big so that the start up time is only a small part of the total compilation time, which weakening the advantage of MIR in start-up time

Yes. That is the reason why MIR is not faster LLVM.

2. Although there are fewer passes in MIR, the passes in LLVM runs much faster than MIR especially for big functions, so the speed increment of MIR compared to LLVM is not as high as expected

I would say that the compilation speed of MIR should be somewhere between LLVM -O0 and LLVM -O1. To be more detailed, the same set of optimizations in LLVM should be a bit faster than MIR ones because LLVM is more refined. Although I did not work on improving MIR generation speed yet. May be I can be closer to speed of LLVM optimization after that.

Also probably some of my optimizations are not well scaled for bigger functions and I should work on their algorithmic improvements. A lot of optimizations in compilers are not O(n) sometimes they are O(n^2) or even more. The testcase you sent me could help me with this work.

Is my understanding correct and complete?

I believe so.

200ms means that MIR should be sped up in 100 times. It is hard for me to say how you measure this and what is your test case actually. Unless I do something stupid, I don't think MIR can be sped up in 100 times.

This speed is really amazing. Maybe I can check this later.

I don't know internals of cranelift. But I guess such speed can be achieved by direct translation (no CFG,no optimizations, only local RA, and no code selection) like Gnu Lighting works. Only in this case the generated code performance will be about 1/3 of GCC/LLVM with -O2.

vnmakarov avatar Apr 19 '20 02:04 vnmakarov

Thank you for your test case. I really appreciate it. It is interesting. I did not expect that MIR will be used for such big code. It is 500K lines module with about 2-3K lines per function.

I'm glad to know that my case is useful to you.

I don't know internals of cranelift. But I guess such speed can be achieved by direct translation (no CFG,no optimizations, only local RA, and no code selection) like Gnu Lighting works. Only in this case the generated code performance will be about 1/3 of GCC/LLVM with -O2.

In fact, I don’t know too much about the backend of the compiler, but I agree with you because such a fast compilation speed must be at the expense of performance.

Thanks again for your detailed answer and for such a great project !

xujuntwt95329 avatar Apr 19 '20 07:04 xujuntwt95329

I'm glad to know that my case is useful to you.

I did some profiling on your test case. There is definitely room for improvements. For example, for simplicity I use simple bitmaps. For bigger functions they can be big. Usually compilers (e.g. GCC) use some sort of sparse bitmaps to deal effectively with big functions and they rarely use simple bitmaps.

I am also thinking about implementing much faster code generation (no building CFG, no any global analysis, only local RA). It is an interesting task.

I don't know when it will be done, may be even not in the 1st release. But speeding MIR generator is an interesting work and I'll definitely work on it.

vnmakarov avatar Apr 20 '20 14:04 vnmakarov

I don't know when it will be done, may be even not in the 1st release. But speeding MIR generator is an interesting work and I'll definitely work on it.

Nice to hear this, looking forward to future release of MIR !

By the way, I've tested wasmtime(cranelist) on my case, I find that it use multi-threading to speed up the compilation. On my PC with 16-core CPU, its total time is ~200ms, but when I limit the CPU core to 1, the total time is 1.507s, nearly 8x slower than the 16 core run. I wonder will it be possible to speed up MIR by using threads? Of course, the 1.507s on single core is still much faster than MIR and LLVM approach.

And I'm a little confused about the slow compilation speed of my case. Although MIR passes may run slower on big functions, its total number of optimization passes should be much fewer than LLVM's, so there should still be a lot of improvement on MIR, do you think there may be possibly other factors that affect the compilation speed ?

Thanks

xujuntwt95329 avatar Apr 20 '20 15:04 xujuntwt95329

Hi,

I think that many projects that use LLVM have it linked in as a library and there is no startup time. LLVM codegen is quite fast as it has been tuned endlessly. I don't have comparable figures as in Ravi LLVM IR is generated directly but MIR IR is generated via intermediate C code, so MIR has an extra step. However in Ravi I use LLVM -O2 optimizations - which is setup via the PassManager. This means almost the same passes as C++/C code would go through. Even so LLVM is not slow.

Personally I think MIR should be at least as fast as LLVM in linked in mode (i.e. not via Clang invocation). I think there is no point trying to be faster. Instead MIR should focus on:

a) Being very small library as it is today b) Achieving 70% of -O2 performance in the generated code

Just my thoughts.

Regards Dibyendu

dibyendumajumdar avatar Apr 20 '20 17:04 dibyendumajumdar

I should add that most projects have background compilation so compilation speed is usually not an issue anyway. LLVM now has inbuilt multi-thread lazy compilation on demand

dibyendumajumdar avatar Apr 20 '20 17:04 dibyendumajumdar

By the way, I've tested wasmtime(cranelist) on my case, I find that it use multi-threading to speed up the compilation. On my PC with 16-core CPU, its total time is ~200ms, but when I limit the CPU core to 1, the total time is 1.507s, nearly 8x slower than the 16 core run. I wonder will it be possible to speed up MIR by using threads?

MIR does not use threads. But its code can be adapted for parallel compiling in multiple threads. You just need to use different contexts for threads.

Of course, the 1.507s on single core is still much faster than MIR and LLVM approach.

I see. So cranelift is 14 times faster MIR right now on your test. I guess that implementing the faster MIR code generation I mentioned could achieve the same speed.

And I'm a little confused about the slow compilation speed of my case. Although MIR passes may run slower on big functions, its total number of optimization passes should be much fewer than LLVM's, so there should still be a lot of improvement on MIR, do you think there may be possibly other factors that affect the compilation speed ?

Yes, LLVM has a lot of optimizations (I guess about hundred ones) but the big part of time is spent in few optimizations (e.g. RA and code selection). For example, RA in GCC consumes >20% on a test I frequently use (500K fortran lines program containing 7K subroutines).

vnmakarov avatar Apr 20 '20 18:04 vnmakarov

I should add that most projects have background compilation so compilation speed is usually not an issue anyway. LLVM now has inbuilt multi-thread lazy compilation on demand

Thank you for the info about LLVM, Dibyendu.

vnmakarov avatar Apr 20 '20 18:04 vnmakarov

MIR does not use threads. But its code can be adapted for parallel compiling in multiple threads. You just need to use different contexts for threads.

Yes, we can use multiple threads to compile multiple contexts, but each context should have its own MIR_module, in this way I have to divide all my functions in MIR into different modules, if func1 in module1 call a func2 in module2, how does the codegen know the actual address of func2 as module2 is in another context?

I should add that most projects have background compilation so compilation speed is usually not an issue anyway. LLVM now has inbuilt multi-thread lazy compilation on demand

Thanks @dibyendumajumdar for the information. In our runtime, we use LLVM to compile all the wasm functions before execution, so the compilation speed of JIT is important as it may influence the start up time of the application.

xujuntwt95329 avatar Apr 23 '20 05:04 xujuntwt95329

Hi re LLVM I think you need to use ORC v2 apis to get multi-threading and lazy compilation. https://www.youtube.com/watch?v=MOQG5vkh9J8

dibyendumajumdar avatar Apr 23 '20 09:04 dibyendumajumdar

Yes, we can use multiple threads to compile multiple contexts, but each context should have its own MIR_module, in this way I have to divide all my functions in MIR into different modules, if func1 in module1 call a func2 in module2, how does the codegen know the actual address of func2 as module2 is in another context?

Currently there is no possibility to use functions generated from different context. In JIT environment it means that JITted code should only calls JITted functions from the same context, already existing run-time functions or call an interpreter.

In the future I am planning to permit to call JITted code from a different context. It requires to implement some thread-safe data structures. I am planning to implement some MIR speculative insns (and some support of this in c2mir) which will collect run-time data. Depending on the data, a different JITted code will be generated. It also requires to implement some thread-safe data collection as JITted code can be executed by different threads. I was planning to implement this at the same time. Speculative insns will be necessary for better performance implementation of dynamic languages.

vnmakarov avatar Apr 23 '20 23:04 vnmakarov

I'd like to give some update for this issue. Some details here are mostly for myself to keep these notes.

First of all MIR-generator can be used in different optimization levels. The current default optimization level is -O2. Probably it is better to make -O1 optimization level as default as for my benchmarks on x86-64 going from -O1 to -O2 improves code performance by about 3% (geomean) slowing down the generator > 2 times.

For -O1 generator runs only RA, combiner (code selection) and live analysis (3 times) and dead code elimination several times. Most time (70%) is spent in live-analysis and RA (approximately equally).

So I removed one pass of live analysis unnecessary for -O1.

I also sped up RA almost by 2 times by improving bitmap code, major loop in assign and by adding live-range compression. I implemented sparse bitmaps which should work better for big functions (2K MIR insns in the test case) but the result was the opposite as the code is more complicated and this complication probably overwhelms bitmap sparseness. May be for much bigger functions they will be benefitial. In any case I threw sparse bitmaps code (it can still be found in git repository).

I've tried couple simple RA implementations w/o life analysis (one based on approximation of life analysis computation and another one based on local RA) but the resulted code performance was so bad (sometimes 3 times worse) so I don't think it is worth to have it even if the generator speed improves in 3 times by speeding RA by 70% (local RA takes 1.6B insns vs 2.7B insns for current RA for given test case) and removing life analysis completely. If I can find/design a good RA w/o life analysis, this approach can be viable.

So currently the generator speed (in executed billions of insns) for given test case for different optimization levels and geomean performance improvements from previous level on my benchmarks (make c2mir-bench) looks like

-O0  - 8.738B
-O1 - 10.864B  +27%
-O2 - 25.178B  +3%
-O3 - 43.673B  +1%

Probably the benchmarks (mostly from old computer language shooutout) I use are not good enough and I need better ones.

I should also look at improving life analysis. I've checked that the data-flow solver processes BB about 1.3 times for the test case which means it works ok (the right BB traversing order is used).

Probably I can improve life-analysis by excluding local vars from it. I don't know what the results will be as I need a pass to identify local vars and what bitmap sizes I have on the test case. In any case, I think it is worth to try.

vnmakarov avatar May 09 '20 16:05 vnmakarov

Hi, I thought I would share following timings. These results show the time taken to compile and execute thousands of Ravi/Lua functions in the Ravi test suite. For MIR the C2MIR front-end is used. Both MIR and LLVM backends are run with -O2. I am using the LLVM PassManager to setup the optimization passes - believe these are the same as that used by front-ends such as Clang.

  • MIR
real	1m58.523s
user	1m57.161s
sys	0m1.340s
  • LLVM 10 ORC v2 api
real	2m26.194s
user	2m25.308s
sys	0m0.852s

There is a mix of compilation and execution performance times here as the overall result depends on both these factors.

dibyendumajumdar avatar May 25 '20 14:05 dibyendumajumdar

Dibyendu, thank you for the numbers.

Actually I did not expect that C2MIR will be faster LLVM. As I wrote C2MIR design was not about the compilation speed but more about the simplicity.

As for the MIR-code generator, I've just added a fast generator. The generator speed on test sent by Xu is 4-5 times faster than generation with -O1/-O2. The generated code performance is about the same as Clang/GCC with -O0 (sometimes better, sometimes worse).

vnmakarov avatar May 25 '20 16:05 vnmakarov

As for the MIR-code generator, I've just added a fast generator. The generator speed on test sent by Xu is 4-5 times faster than generation with -O1/-O2. The generated code performance is about the same as Clang/GCC with -O0 (sometimes better, sometimes worse).

Yes I noticed that you now have a fast generator. I have not tried it yet but I suspect that the generated code performance will be suboptimal?

dibyendumajumdar avatar May 25 '20 16:05 dibyendumajumdar

Yes I noticed that you now have a fast generator. I have not tried it yet but I suspect that the generated code performance will be suboptimal?

The code performance generated by fast gen will be 2-3 times worse (as it for GCC -O0 vs GCC -O1/-O2).

vnmakarov avatar May 25 '20 17:05 vnmakarov

As for the MIR-code generator, I've just added a fast generator. The generator speed on test sent by Xu is 4-5 times faster than generation with -O1/-O2. The generated code performance is about the same as Clang/GCC with -O0 (sometimes better, sometimes worse).

This is awesome! We will test it for WAMR in our environment. Thanks for the great project ! @vnmakarov

Kevin0626 avatar May 25 '20 23:05 Kevin0626

@vnmakarov Is it possible to support parallel fast gen with multi-threading?

xwang98 avatar Nov 19 '21 23:11 xwang98