llvm icon indicating copy to clipboard operation
llvm copied to clipboard

Performance

Open mewmew opened this issue 7 years ago • 4 comments

This issue is intended to profile the performance of the llir/llvm library, measure it against the official LLVM distribution and evaluate different methods for improving the performance.

This is a continuation of https://github.com/mewmew/l/issues/6

The benchmark suite is at https://github.com/decomp/testdata. Specifically, the LLVM IR assembly of these projects are used in the benchmark:

Below follows a first evaluation of using concurrency to speed up parsing. The evaluation is based on a very naiive implementation of concurrency, just to get some initial runtime numbers. It is based on llir/llvm@30113964203b046c328dd2bc2da99fee63a5403d of the development branch, and subsets of the following patch has been applied https://gist.github.com/mewmew/d127b562fdd8f560222b4ded739861a7

Official LLVM results

For comparison, below are the runtime results of the opt tool from the official LLVM distribution (using opt -verify foo.ll).

Coreutils

real 8.18
user 7.22
sys 0.88

SQLite

real 1.90
user 1.73
sys 0.13

llir/llvm results

Coreutils

no concurrency

total time for file "testdata/coreutils/testdata/yes.ll": 55.744113ms
real 11.54
user 14.70
sys 0.16

concurrent translateTopLevelEntities

  • 4 go-routines with waitgroup in translateTopLevelEntities
total time for file "testdata/coreutils/testdata/yes.ll": 53.49785ms
real 10.28
user 16.06
sys 0.15

concurrent translateGlobals

  • 2 go-routines with waitgroup in translateGlobals (for global and function definitions)
total time for file "testdata/coreutils/testdata/yes.ll": 55.567134ms
real 9.83
user 17.18
sys 0.17

concurrent translateTopLevelEntities and translateGlobals

  • 4 go-routines with waitgroup in translateTopLevelEntities
  • 2 go-routines with waitgroup in translateGlobals (for global and function definitions)
total time for file "testdata/coreutils/testdata/yes.ll": 58.474581ms
real 9.23
user 18.08
sys 0.16

SQLite3

no concurrency

total time for file "shell.ll": 3.147106433s
real 3.18
user 3.86
sys 0.32

concurrent translateTopLevelEntities

  • 4 go-routines with waitgroup in translateTopLevelEntities
total time for file "shell.ll": 2.848574349s
real 2.88
user 4.67
sys 0.32

concurrent translateGlobals

  • 2 go-routines with waitgroup in translateGlobals (for global and function definitions)
total time for file "testdata/sqlite/testdata/shell.ll": 2.86919391s
real 2.90
user 4.90
sys 0.32

concurrent translateTopLevelEntities and translateGlobals

  • 4 go-routines with waitgroup in translateTopLevelEntities
  • 2 go-routines with waitgroup in translateGlobals (for global and function definitions)
total time for file "shell.ll": 2.897873366s
real 2.93
user 4.79
sys 0.33

mewmew avatar Nov 09 '18 03:11 mewmew

The following comparison is based on rev 9c2e8ca9680265961af125c012a0ecde0042ce7d.

llir is 1.85x slower than LLVM to parse shell.ll of SQLite.

llir is 1.55x slower than LLVM to parse vdir.ll of Coreutils.

This puts us within 2x of LLVM, which should be good enough for the v0.3.0 release.

It would be great to improve this in the future! And there is definitely some low hanging fruit to be found. @pwaller has gotten a ~10% speed increase by making changes to Textmapper, so that the parser uses values instead of pointers for subtype (%interface) AST nodes, which puts less preasure on the Go garbage collector.

@pwaller, would you mind creating an issue to track the Textmapper work above so we don't loose it? Do you have the diff somewhere of what was needed? :)

Edit: There is also work to reduce allocations and improve performance of LLVM IR output, by defining a WriteTo methodl (see #41).

The specifics for calculating the llir vs LLVM slowdown factor is presented below.

SQLite

LLVM

average real time: 1.69

$ time opt -verify < sqlite/test/shell.ll

real 1.69
user 1.58
sys 0.09

$ time opt -verify < sqlite/test/shell.ll

real 1.68
user 1.54
sys 0.12

$ time opt -verify < sqlite/test/shell.ll

real 1.70
user 1.57
sys 0.10

llir

average real time: 3.12

$ time l-tm sqlite/test/shell.ll
asm: parsing into AST took: 1.318785166s
asm: index AST top-level entities took: 105.619707ms
asm: type resolution took: 1.635337ms
asm: create IR top-level entities took: 49.509083ms
asm: translate AST to IR took: 1.630427198s
asm: add IR definitions to IR module took: 50.121644ms
total time for file "sqlite/test/shell.ll": 3.193751643s

real 3.22
user 3.67
sys 0.25

$ time l-tm sqlite/test/shell.ll
asm: parsing into AST took: 1.332296208s
asm: index AST top-level entities took: 105.263225ms
asm: type resolution took: 1.654263ms
asm: create IR top-level entities took: 49.985311ms
asm: translate AST to IR took: 1.628654783s
asm: add IR definitions to IR module took: 49.276926ms
total time for file "sqlite/test/shell.ll": 3.205750564s

real 3.23
user 3.64
sys 0.29

$ time l-tm sqlite/test/shell.ll
asm: parsing into AST took: 1.333703175s
asm: index AST top-level entities took: 109.893044ms
asm: type resolution took: 1.653571ms
asm: create IR top-level entities took: 51.090025ms
asm: translate AST to IR took: 1.327419262s
asm: add IR definitions to IR module took: 49.214236ms
total time for file "sqlite/test/shell.ll": 2.899078532s

real 2.92
user 3.74
sys 0.28

Coreutils

LLVM

average real time: 0.22

$ time opt -verify < coreutils/test/vdir.ll

real 0.24
user 0.20
sys 0.03

$ time opt -verify < coreutils/test/vdir.ll

real 0.22
user 0.19
sys 0.02

$ time opt -verify < coreutils/test/vdir.ll

real 0.21
user 0.20
sys 0.01

llir

average real time: 0.34

$ time l-tm coreutils/test/vdir.ll
asm: parsing into AST took: 183.7889ms
asm: index AST top-level entities took: 7.405762ms
asm: type resolution took: 135.637µs
asm: create IR top-level entities took: 4.598706ms
asm: translate AST to IR took: 147.208001ms
asm: add IR definitions to IR module took: 3.592691ms
total time for file "coreutils/test/vdir.ll": 355.543323ms

real 0.36
user 0.54
sys 0.03

$ time l-tm coreutils/test/vdir.ll
asm: parsing into AST took: 168.837893ms
asm: index AST top-level entities took: 9.344586ms
asm: type resolution took: 179.538µs
asm: create IR top-level entities took: 4.867541ms
asm: translate AST to IR took: 125.596136ms
asm: add IR definitions to IR module took: 3.579648ms
total time for file "coreutils/test/vdir.ll": 322.248448ms

real 0.32
user 0.40
sys 0.04

$ time l-tm coreutils/test/vdir.ll
asm: parsing into AST took: 166.491659ms
asm: index AST top-level entities took: 9.805071ms
asm: type resolution took: 244.146µs
asm: create IR top-level entities took: 11.506275ms
asm: translate AST to IR took: 130.762989ms
asm: add IR definitions to IR module took: 3.573189ms
total time for file "coreutils/test/vdir.ll": 332.843371ms

real 0.33
user 0.38
sys 0.05

mewmew avatar Dec 07 '18 21:12 mewmew

As we've reached the hand-wavy ~2x slowdown target for the v0.3.0 release, I've changed the milestone to future, as we will constantly seek to improve the performance of llir. It would be really interesting also to explore using concurrency to spin up the cores of multicore machines. I tried a naive approach for this and it seems to be a valid approach for bringing wall clock run time improvements. We just need to figure out how to do it in a less naive way, and also to find the trade-off factors for when we should spin up Go-routines and how many to run. We can define threshold limits for these.

Challenge: beat LLVM wall time performance using concurrency

Anyone interested in experimenting a bit, have a look at asm/translate.go which defines the high-level AST to IR translation order. In it, a few good candidates for concurrent execution have been identified:

Order of translation

Note: step 3 and the substeps of 4a can be done concurrently. Note: the substeps of 4a can be done concurrently. Note: the substeps of 4b can be done concurrently. Note: steps 5-7 can be done concurrently. Note: the substeps of 8 can be done concurrently.

  1. Index AST top-level entities.

  2. Resolve IR type definitions.

    a. Index type identifiers and create scaffolding IR type definitions (without bodies).

    b. Translate AST type definitions to IR.

  3. Translate AST comdat definitions to IR.

Note: step 3 and the substeps of 4a can be done concurrently.

  1. Resolve remaining IR top-level entities.

    a. Index top-level identifiers and create scaffolding IR top-level declarations and definitions (without bodies but with types).

    Note: the substeps of 4a can be done concurrently.

    1. Index global identifiers and create scaffolding IR global declarations and definitions, indirect symbol definitions, and function declarations and definitions (without bodies but with types).

    2. Index attribute group IDs and create scaffolding IR attribute group definitions (without bodies).

    3. Index metadata names and create scaffolding IR named metadata definitions (without bodies).

    4. Index metadata IDs and create scaffolding IR metadata definitions (without bodies).

    b. Translate AST top-level declarations and definitions to IR.

    Note: the substeps of 4b can be done concurrently.

    1. Translate AST global declarations and definitions, indirect symbol definitions, and function declarations and definitions to IR.

    2. Translate AST attribute group definitions to IR.

    3. Translate AST named metadata definitions to IR.

    4. Translate AST metadata definitions to IR.

Note: steps 5-7 can be done concurrenty.

  1. Translate use-list orders.

  2. Translate basic block specific use-list orders.

  3. Fix basic block references in blockaddress constants.

  4. Add IR top-level declarations and definitions to the IR module in order of occurrence in the input.

    Note: the substeps of 8 can be done concurrently.

    a. Add IR type definitions to the IR module in alphabetical order.

    b. Add IR comdat definitions to the IR module in alphabetical order.

    c. Add IR global variable declarations and definitions, indirect symbol definitions, and function declarations and definitions to the IR module in order of occurrence in the input.

    d. Add IR attribute group definitions to the IR module in numeric order.

    e. Add IR named metadata definitions to the IR module in order of occurrence in the input.

    f. Add IR metadata definitions to the IR module in numeric order.

mewmew avatar Dec 07 '18 21:12 mewmew

It would be great to improve this in the future! And there is definitely some low hanging fruit to be found. @pwaller has gotten a ~10% speed increase by making changes to Textmapper, so that the parser uses values instead of pointers for subtype (%interface) AST nodes, which puts less preasure on the Go garbage collector.

Filed: https://github.com/inspirer/textmapper/issues/29

pwaller avatar Dec 09 '18 15:12 pwaller

Moving performance related note from TODO.md to this issue.

  • figure out how to remove backtracking table in lexer (ref: http://textmapper.org/documentation.html#backtracking-and-invalid-tokens)

mewmew avatar Dec 29 '19 00:12 mewmew