js_of_ocaml
js_of_ocaml copied to clipboard
[PERF] Dominance frontier calculation appears to be O(n^2)
A large jsoo application I've encountered at work spends about 9 minutes compiling with js_of_ocaml using direct-compilation (separate compilation is much faster). Profiling the compiler shows that almost all of this time is spent in generate.ml, specifically computing dominance frontiers.
I believe that the issue is that although dominance frontier calculation can be done for the whole program in one pass, the implementation in generate.ml will compute the frontier for each node individually, resulting in no opportunity to share results.
I suspect that there’s something super linear going on here because a from-scratch separate-compilation build of the same project spends ~20x less time doing dominance frontier calculation
Can you share some code reproducing the issue ? What's the size of the bytecode file ? Have you tried --disable inline ?
Can you share some code reproducing the issue?
Sadly I can not
What's the size of the bytecode file?
112MB
Have you tried
--disable inline?
I have not, though I can try it out and see what happens. I should point out that it's using direct-compilation, but doesn't use any of the other "optimization" flags.
From debugging this a few weeks ago, I remember that the primary issue was that a region of the dominator tree is effectively just a singly-linked list (a node in the dominator tree has a single child, which has a single child, which has a single child, etc...) that it spends most of its time traversing. Is it possible that this is the list of constants/globals or maybe the toplevel statements?
Have you tried --disable inline?
I have not, though I can try it out and see what happens. I should point out that it's using direct-compilation, but doesn't use any of the other "optimization" flags.
It appears to make no difference
remember that the primary issue was that a region of the dominator tree is effectively just a singly-linked list
Because of this, I think the dominator tree depth was somewhere on the order of ~100,000
THE PLOT THICKENS
Passing --sourcemap makes basically every pass in the compiler dramatically slower, including codegen.
With --sourcemap
parsing js: 0.19
Start parsing...
parsing: 13.50
Start Optimizing...
tail calls: 0.13
flow analysis 1: 1.52
flow analysis 2: 1.07
flow analysis 3: 0.25
flow analysis 4: 1.23
flow analysis 5: 4.42
flow analysis: 8.49
dead code elim.: 1.48
inlining: 0.73
dead code elim.: 1.61
tail calls: 0.08
phi-simpl. 1: 0.48
phi-simpl. 2: 0.56
phi-simpl.: 1.24
flow analysis 1: 0.30
flow analysis 2: 0.33
flow analysis 3: 0.21
flow analysis 4: 0.38
flow analysis 5: 0.85
flow analysis: 2.06
dead code elim.: 1.07
inlining: 0.31
dead code elim.: 0.79
flow analysis 1: 0.14
flow analysis 2: 0.20
flow analysis 3: 0.18
flow analysis 4: 0.27
flow analysis 5: 0.53
flow analysis: 1.32
dead code elim.: 1.14
inlining: 0.34
dead code elim.: 0.85
phi-simpl. 1: 0.14
phi-simpl. 2: 0.20
phi-simpl.: 0.53
flow analysis 1: 0.39
flow analysis 2: 0.39
flow analysis 3: 0.39
flow analysis 4: 0.42
flow analysis 5: 0.47
flow analysis: 2.07
free vars: 0.45
dead code elim.: 0.81
optimizations : 27.86
Start Generation...
// 10 minutes pass and then it stack overflows
Without --sourcemap
parsing js: 0.19
Start parsing...
parsing: 0.84
Start Optimizing...
tail calls: 0.07
flow analysis 1: 0.43
flow analysis 2: 0.60
flow analysis 3: 0.08
flow analysis 4: 0.40
flow analysis 5: 0.43
flow analysis: 1.94
dead code elim.: 0.71
inlining: 0.41
dead code elim.: 0.81
tail calls: 0.08
phi-simpl. 1: 0.46
phi-simpl. 2: 0.39
phi-simpl.: 1.00
flow analysis 1: 0.24
flow analysis 2: 0.29
flow analysis 3: 0.10
flow analysis 4: 0.19
flow analysis 5: 0.34
flow analysis: 1.16
dead code elim.: 0.54
inlining: 0.29
dead code elim.: 0.57
flow analysis 1: 0.35
flow analysis 2: 0.25
flow analysis 3: 0.08
flow analysis 4: 0.11
flow analysis 5: 0.16
flow analysis: 0.94
dead code elim.: 0.50
inlining: 0.25
dead code elim.: 0.58
phi-simpl. 1: 0.18
phi-simpl. 2: 0.12
phi-simpl.: 0.65
flow analysis 1: 0.26
flow analysis 2: 0.15
flow analysis 3: 0.08
flow analysis 4: 0.10
flow analysis 5: 0.18
flow analysis: 0.78
free vars: 0.30
dead code elim.: 0.52
optimizations : 13.46
Start Generation...
code gen.: 10.80
Start Linking...
linking: 0.53
Start Optimizing js...
share constant: 0.74
simpl: 0.31
clean: 0.26
shortten vars: 1.28
optimizing: 3.24
Start Coloring...
coloring: 5.79
Start Checks...
checks: 1.21
Start Writing file...
write: 0.28
compilation: 36.35
The same is true for jsoo link
with --sourcemap: 5.1s without --sourcemap: 0.5s without --sourcemap and all input .js files don't have a sourcemap comment: 0.08s
This one doesn't surprise me as much; most of what I want jsoo-link to do is stitch-up sourcemaps correctly, but I wasn't expecting a full order of magnitude
without --sourcemap: 0.5s without --sourcemap and all input .js files don't have a sourcemap comment: 0.08s
#1313 should reduce this difference.
with --sourcemap: 5.1s without --sourcemap: 0.5s
It would be interesting to get numbers about the size of the debug info. Can you check the impact of #1313 on these timings ? I suspect it won't change much.
Passing --sourcemap makes basically every pass in the compiler dramatically slower, including codegen.
passing --sourcemap forces the debug info to be read and used. You can achieve the same thing with --debug-info or --pretty.
Could you report timings when using --debug-info and --pretty separately ?
I suspect that https://github.com/ocsigen/js_of_ocaml/blob/9ea685219b3210f73dcf48c66a19b79ed8a3fa19/compiler/lib/parse_bytecode.ml#L400 is what cost you the most (triggered by --debug-info and --source-map, but not --pretty). It creates lots of "artificial" blocks for the purpose of holding more debug-info.
The change was Introduced in https://github.com/ocsigen/js_of_ocaml/commit/08124c05888dc6cd33d4fa261dcd9803e6864b77https://github.com/ocsigen/js_of_ocaml/commit/08124c05888dc6cd33d4fa261dcd9803e6864b77.
https://github.com/ocsigen/js_of_ocaml/blob/9ea685219b3210f73dcf48c66a19b79ed8a3fa19/compiler/lib/generate.ml#L1359 is where we map the pc of a block to its location info.
If my understanding is correct, we should be able to disconnect how we track debug infos from how we split blocks. Maybe @vouillon has more to say about this.
Can you check the impact of https://github.com/ocsigen/js_of_ocaml/pull/1313 on these timings ? I suspect it won't change much.
I'll try it out
For what it's worth, I think that my original diagnosis that dominance frontier calculation is quadratic is still correct. This recent development with debuginfo shows that every pass gets ~2x slower, but codegen gets slower quadratically with respect to code size.
Here is a minimal repro case.
500 consecutive function calls
$ ocaml <(echo "for i = 0 to 500 do print_endline \"print_char '.';\"; done") > a.ml && ocamlc -g a.ml -o a.byte && js_of_ocaml --debuginfo --debug times a.byte
...
Start Generation...
code gen.: 0.03
...
compilation: 0.17
5_000 consecutive function calls
$ ocaml <(echo "for i = 0 to 5000 do print_endline \"print_char '.';\"; done") > a.ml && ocamlc -g a.ml -o a.byte && js_of_ocaml --debuginfo --debug times a.byte
...
Start Generation...
code gen.: 3.14
...
compilation: 3.56
10_000 consecutive function calls
$ ocaml <(echo "for i = 0 to 10000 do print_endline \"print_char '.';\"; done") > a.ml && ocamlc -g a.ml -o a.byte && js_of_ocaml --debuginfo --debug times a.byte
...
Start Generation...
code gen.: 14.13
...
compilation: 14.97