js_of_ocaml icon indicating copy to clipboard operation
js_of_ocaml copied to clipboard

[PERF] Dominance frontier calculation appears to be O(n^2)

Open TyOverby opened this issue 3 years ago • 6 comments

A large jsoo application I've encountered at work spends about 9 minutes compiling with js_of_ocaml using direct-compilation (separate compilation is much faster). Profiling the compiler shows that almost all of this time is spent in generate.ml, specifically computing dominance frontiers.

I believe that the issue is that although dominance frontier calculation can be done for the whole program in one pass, the implementation in generate.ml will compute the frontier for each node individually, resulting in no opportunity to share results.

TyOverby avatar Sep 02 '22 02:09 TyOverby

I suspect that there’s something super linear going on here because a from-scratch separate-compilation build of the same project spends ~20x less time doing dominance frontier calculation

TyOverby avatar Sep 02 '22 02:09 TyOverby

Can you share some code reproducing the issue ? What's the size of the bytecode file ? Have you tried --disable inline ?

hhugo avatar Sep 02 '22 08:09 hhugo

Can you share some code reproducing the issue?

Sadly I can not

What's the size of the bytecode file?

112MB

Have you tried --disable inline?

I have not, though I can try it out and see what happens. I should point out that it's using direct-compilation, but doesn't use any of the other "optimization" flags.

TyOverby avatar Sep 02 '22 13:09 TyOverby

From debugging this a few weeks ago, I remember that the primary issue was that a region of the dominator tree is effectively just a singly-linked list (a node in the dominator tree has a single child, which has a single child, which has a single child, etc...) that it spends most of its time traversing. Is it possible that this is the list of constants/globals or maybe the toplevel statements?

TyOverby avatar Sep 02 '22 13:09 TyOverby

Have you tried --disable inline?

I have not, though I can try it out and see what happens. I should point out that it's using direct-compilation, but doesn't use any of the other "optimization" flags.

It appears to make no difference

TyOverby avatar Sep 02 '22 13:09 TyOverby

remember that the primary issue was that a region of the dominator tree is effectively just a singly-linked list

Because of this, I think the dominator tree depth was somewhere on the order of ~100,000

TyOverby avatar Sep 02 '22 13:09 TyOverby

THE PLOT THICKENS

Passing --sourcemap makes basically every pass in the compiler dramatically slower, including codegen.

With --sourcemap

  parsing js: 0.19
Start parsing...
  parsing: 13.50
Start Optimizing...
  tail calls: 0.13
    flow analysis 1: 1.52
    flow analysis 2: 1.07
    flow analysis 3: 0.25
    flow analysis 4: 1.23
    flow analysis 5: 4.42
  flow analysis: 8.49
  dead code elim.: 1.48
  inlining: 0.73
  dead code elim.: 1.61
  tail calls: 0.08
    phi-simpl. 1: 0.48
    phi-simpl. 2: 0.56
  phi-simpl.: 1.24
    flow analysis 1: 0.30
    flow analysis 2: 0.33
    flow analysis 3: 0.21
    flow analysis 4: 0.38
    flow analysis 5: 0.85
  flow analysis: 2.06
  dead code elim.: 1.07
  inlining: 0.31
  dead code elim.: 0.79
    flow analysis 1: 0.14
    flow analysis 2: 0.20
    flow analysis 3: 0.18
    flow analysis 4: 0.27
    flow analysis 5: 0.53
  flow analysis: 1.32
  dead code elim.: 1.14
  inlining: 0.34
  dead code elim.: 0.85
    phi-simpl. 1: 0.14
    phi-simpl. 2: 0.20
  phi-simpl.: 0.53
    flow analysis 1: 0.39
    flow analysis 2: 0.39
    flow analysis 3: 0.39
    flow analysis 4: 0.42
    flow analysis 5: 0.47
  flow analysis: 2.07
  free vars: 0.45
  dead code elim.: 0.81
 optimizations : 27.86
Start Generation...
// 10 minutes pass and then it stack overflows

Without --sourcemap

  parsing js: 0.19
Start parsing...
  parsing: 0.84
Start Optimizing...
  tail calls: 0.07
    flow analysis 1: 0.43
    flow analysis 2: 0.60
    flow analysis 3: 0.08
    flow analysis 4: 0.40
    flow analysis 5: 0.43
  flow analysis: 1.94
  dead code elim.: 0.71
  inlining: 0.41
  dead code elim.: 0.81
  tail calls: 0.08
    phi-simpl. 1: 0.46
    phi-simpl. 2: 0.39
  phi-simpl.: 1.00
    flow analysis 1: 0.24
    flow analysis 2: 0.29
    flow analysis 3: 0.10
    flow analysis 4: 0.19
    flow analysis 5: 0.34
  flow analysis: 1.16
  dead code elim.: 0.54
  inlining: 0.29
  dead code elim.: 0.57
    flow analysis 1: 0.35
    flow analysis 2: 0.25
    flow analysis 3: 0.08
    flow analysis 4: 0.11
    flow analysis 5: 0.16
  flow analysis: 0.94
  dead code elim.: 0.50
  inlining: 0.25
  dead code elim.: 0.58
    phi-simpl. 1: 0.18
    phi-simpl. 2: 0.12
  phi-simpl.: 0.65
    flow analysis 1: 0.26
    flow analysis 2: 0.15
    flow analysis 3: 0.08
    flow analysis 4: 0.10
    flow analysis 5: 0.18
  flow analysis: 0.78
  free vars: 0.30
  dead code elim.: 0.52
 optimizations : 13.46
Start Generation...
  code gen.: 10.80
Start Linking...
  linking: 0.53
Start Optimizing js...
    share constant: 0.74
    simpl: 0.31
    clean: 0.26
    shortten vars: 1.28
  optimizing: 3.24
Start Coloring...
  coloring: 5.79
Start Checks...
  checks: 1.21
Start Writing file...
  write: 0.28
compilation: 36.35

TyOverby avatar Oct 12 '22 22:10 TyOverby

The same is true for jsoo link

with --sourcemap: 5.1s without --sourcemap: 0.5s without --sourcemap and all input .js files don't have a sourcemap comment: 0.08s

TyOverby avatar Oct 12 '22 22:10 TyOverby

This one doesn't surprise me as much; most of what I want jsoo-link to do is stitch-up sourcemaps correctly, but I wasn't expecting a full order of magnitude

TyOverby avatar Oct 12 '22 22:10 TyOverby

without --sourcemap: 0.5s without --sourcemap and all input .js files don't have a sourcemap comment: 0.08s

#1313 should reduce this difference.

with --sourcemap: 5.1s without --sourcemap: 0.5s

It would be interesting to get numbers about the size of the debug info. Can you check the impact of #1313 on these timings ? I suspect it won't change much.

hhugo avatar Oct 13 '22 10:10 hhugo

Passing --sourcemap makes basically every pass in the compiler dramatically slower, including codegen.

passing --sourcemap forces the debug info to be read and used. You can achieve the same thing with --debug-info or --pretty.

Could you report timings when using --debug-info and --pretty separately ?

I suspect that https://github.com/ocsigen/js_of_ocaml/blob/9ea685219b3210f73dcf48c66a19b79ed8a3fa19/compiler/lib/parse_bytecode.ml#L400 is what cost you the most (triggered by --debug-info and --source-map, but not --pretty). It creates lots of "artificial" blocks for the purpose of holding more debug-info. The change was Introduced in https://github.com/ocsigen/js_of_ocaml/commit/08124c05888dc6cd33d4fa261dcd9803e6864b77https://github.com/ocsigen/js_of_ocaml/commit/08124c05888dc6cd33d4fa261dcd9803e6864b77.

https://github.com/ocsigen/js_of_ocaml/blob/9ea685219b3210f73dcf48c66a19b79ed8a3fa19/compiler/lib/generate.ml#L1359 is where we map the pc of a block to its location info.

If my understanding is correct, we should be able to disconnect how we track debug infos from how we split blocks. Maybe @vouillon has more to say about this.

hhugo avatar Oct 13 '22 11:10 hhugo

Can you check the impact of https://github.com/ocsigen/js_of_ocaml/pull/1313 on these timings ? I suspect it won't change much.

I'll try it out

TyOverby avatar Oct 13 '22 12:10 TyOverby

For what it's worth, I think that my original diagnosis that dominance frontier calculation is quadratic is still correct. This recent development with debuginfo shows that every pass gets ~2x slower, but codegen gets slower quadratically with respect to code size.

TyOverby avatar Oct 13 '22 12:10 TyOverby

Here is a minimal repro case.

500 consecutive function calls

$ ocaml <(echo "for i = 0 to 500  do print_endline \"print_char '.';\"; done") > a.ml && ocamlc -g a.ml -o a.byte && js_of_ocaml --debuginfo --debug times a.byte
...
Start Generation...
  code gen.: 0.03
...
compilation: 0.17

5_000 consecutive function calls

$ ocaml <(echo "for i = 0 to 5000  do print_endline \"print_char '.';\"; done") > a.ml && ocamlc -g a.ml -o a.byte && js_of_ocaml --debuginfo --debug times a.byte
...
Start Generation...
  code gen.: 3.14
...
compilation: 3.56

10_000 consecutive function calls

$ ocaml <(echo "for i = 0 to 10000  do print_endline \"print_char '.';\"; done") > a.ml && ocamlc -g a.ml -o a.byte && js_of_ocaml --debuginfo --debug times a.byte
...
Start Generation...
  code gen.: 14.13
...
compilation: 14.97

hhugo avatar Oct 18 '22 12:10 hhugo