binaryen Make compilation faster :rocket:

Make compilation faster :rocket:

Open kripken opened this issue 2 years ago • 31 comments

I did some investigation on the speed of compilation. This is getting important because of things like j2cl output which is a 50MB real-world file.

Slowest passes: (in -O3; note that some passes run more than once)

[PassRunner]   running pass: precompute-propagate...           14.23460 seconds.
[PassRunner]   running pass: dae...                            10.91160 seconds.
[PassRunner]   running pass: heap2local...                     10.44350 seconds.
[PassRunner]   running pass: precompute-propagate...            9.07261 seconds.
[PassRunner]   running pass: inlining...                        8.02837 seconds.
[PassRunner]   running pass: once-reduction...                  8.02484 seconds.
[PassRunner]   running pass: vacuum...                          5.16162 seconds.
[PassRunner]   running pass: vacuum...                          3.64088 seconds.
[PassRunner]   running pass: vacuum...                          3.58191 seconds.
[PassRunner]   running pass: vacuum...                          3.61897 seconds.

perf reports almost 15% of time is spent in malloc/free methods - quite a lot. I counted how many mallocs (note: not the size of them) are done in each pass, here are the heaviest (note: timings here include the overhead of counting mallocs atomically among threads, so those matter less):

    once-reduction...                 9.51466 seconds.  141558196 mallocs. 
    precompute-propagate...           14.7426 seconds.   61611533 mallocs. 
    dae...                            11.3789 seconds.   53792948 mallocs. 
    precompute-propagate...           9.69919 seconds.   39531493 mallocs. 
    heap2local...                     10.5116 seconds.   38582302 mallocs. 
    local-cse...                      4.70765 seconds.   33454468 mallocs. 
    cfp...                            2.43619 seconds.   32292682 mallocs. 
    ssa-nomerge...                    4.92099 seconds.   26378117 mallocs. 
    simplify-locals-nostructure...    6.03281 seconds.   22663794 mallocs. 
    vacuum...                         5.10813 seconds.   23880789 mallocs. 
    inlining...                       107.781 seconds.   20940365 mallocs. 
    dce...                            3.84722 seconds.   20025273 mallocs. 
    local-subtyping...                3.44374 seconds.   19067529 mallocs. 

and vacuum runs 3 more times:

    vacuum...                         3.71648 seconds.   16648663 mallocs. 
    vacuum...                         3.86917 seconds.   16302767 mallocs. 
    vacuum...                         3.89855 seconds.   16162915 mallocs.

Help in speeding things up would be very welcome!

Sep 17 '21 23:09 kripken

I think precompute-propagate could be speedup also if interpreter will avoid exceptions: https://github.com/WebAssembly/binaryen/issues/2917#issuecomment-715985374 and uses monadic Expect / Result instead

Sep 18 '21 06:09 MaxGraey

Another variant is rewrite simple constant propagation algorithm to Sparse Conditional Constant Propagation (SCCP pass in LLVM) which should be mush faster

Sep 18 '21 17:09 MaxGraey

@MaxGraey Good idea about precompute-propagate. I checked that now. Running precompute instead (which still uses exceptions, but avoids the propagation using LocalGraph) is just 2-3 seconds. So looks like exceptions are not a significant factor there.

Sep 19 '21 17:09 kripken

Hi there.

I have some statistics for 500MB wasm under Windows 10 (sadly 2 times slower than Linux). -Os -flto -fomit-frame-pointer

[PassRunner] running pass: duplicate-function-elimination... 32.0065 seconds. [PassRunner] running pass: simplify-locals-nostructure... 31.395 seconds. [PassRunner] running pass: simplify-locals... 34.4849 seconds. [PassRunner] running pass: dae-optimizing... 33.4361 seconds. [PassRunner] running pass: inlining-optimizing... 48.7857 seconds. [PassRunner] running pass: optimize-stack-ir... 27.0179 seconds. [PassRunner] running pass: asyncify... 1022.96 seconds. [PassRunner] passes took 1428.02 seconds.

Most passes run as single threaded...

-flto=thin occasionally breaks execution

wasm-ld: warning: function signature mismatch: ... defined as () -> void in ... defined as (i32, i32, i32, i32, i32) -> void in lto.tmp

-Os -flto=thin -fomit-frame-pointer

[PassRunner] running pass: duplicate-function-elimination... 32.5 seconds. [PassRunner] running pass: simplify-locals-nostructure... 37.1433 seconds. [PassRunner] running pass: simplify-locals... 42.461 seconds. [PassRunner] running pass: dae-optimizing... 116.364 seconds. [PassRunner] running pass: inlining-optimizing... 67.1076 seconds. [PassRunner] running pass: optimize-stack-ir... 32.2354 seconds. [PassRunner] running pass: asyncify... 1547.16 seconds. [PassRunner] passes took 2106.44 seconds.

Sep 28 '21 20:09 andrewevstyukhin

Thanks for the data @andrewevstyukhin , looks like by far the largest factor is asyncify:

[PassRunner] running pass: asyncify... 1022.96 seconds.

I think that is because it flattens the IR, which is very costly by itself, does the transform, and then optimizes after to unflatten it. Asyncify will be replaced by stack switching eventually, so this will get resolved that way. But it could be improved a lot meanwhile, I think - currently Flatten generates many more locals than it needs to.

Sep 28 '21 22:09 kripken

Yes, asyncify is heavy. Also it doubles wasm size but not so large in entropy:

1.39.7: wasm-ld 45 357 52 (8 541 334 in 7z) -> wasm-opt 67 428 723 (8 579 733 in 7z) 2.0.14: wasm-ld 44 989 373 (8 374 314 in 7z) -> wasm-opt 69 631 668 (8 717 509 in 7z)

Best time is 481.723 seconds (asyncify 185.695 seconds, 39%).

C++ assertions are really hard for recent versions in "release" builds: 1.39.7: passes took 627.178 seconds (running pass: asyncify... 238.017 seconds) 2.0.14: passes took 1428.02 seconds (running pass: asyncify... 1022.96 seconds)

Sep 29 '21 07:09 andrewevstyukhin

asyncify: I wonder can we make an attribute to mark functions in C++ code additionally to ASYNCIFY_ADD? To not write extern "C" and be more handy than manual ASYNCIFY_ADVISE.

[asyncify] invoke_iiii is an import that can change the state [asyncify] std::__2::__function::__value_func<void\20\28bool\29>::~__value_func\28\29 can change the state due to initial scan [asyncify] bool\20boost::spirit::karma::string_inserterboost::spirit::unused_type\2c\20boost::spirit::unused_type::call<boost::spirit::karma::detail::output_iterator<std::__2::back_insert_iterator<std::__2::basic_string<char\2c\20std::__2::char_traits\2c\20std::2::allocator\20>\20>\2c\20mpl::int<0>\2c\20boost::spirit::unused_type>\2c\20char\20const*>\28boost::spirit::karma::detail::output_iterator<std::__2::back_insert_iterator<std::__2::basic_string<char\2c\20std::__2::char_traits\2c\20std::2::allocator\20>\20>\2c\20mpl::int<0>\2c\20boost::spirit::unused_type>&\2c\20char\20const*\29 can change the state due to

I can try to make an empty function with EMSCRIPTEN_KEEPALIVE and include function in ASYNCIFY_ADD then add calls...

Found that -s ASYNCIFY_ONLY=['*'] breaks execution at legalstub$dynCall_iij

Oct 01 '21 08:10 andrewevstyukhin

@andrewevstyukhin I know there has been some discussion on the LLVM side about adding a way to annotate things in the source so that they show up in a wasm custom section. We could use that for asyncify, and also things like noinline. But I don't think anyone is working on it atm. I don't follow the LLVM side super-closely, though.

Oct 06 '21 20:10 kripken

I'm in doubt about annotations nowadays... Last week I did good enough settings just using ASYNCIFY_REMOVE + ASYNCIFY_ADD in 2.0.14. What can be helpful from my sight:

ASYNCIFY_ONLY and previous ASYNCIFY_WHITELIST don't work with big codebase. Simple setting * breaks calls.
Templates<> and even spaces are impossible things to be passed in command line - maybe @ can help for settings from file.
std::function and [](){} can become configurable only with response file. They are handy things from modern C++.
Maybe multiple passes of remove & add can do precise configuration in hard cases.
ASYNCIFY_IGNORE_INDIRECT breaks big codebase. I'm still unsure why. I want to better configure dynCalls and ignore disabled emscripten_sleep in SDL2.
A robust pattern: remove _*,A*-Z*,a*-z* but d*,e*,i*,l* should stay untuned to support internal emscripten things (dynCall*, emscripten_, invoke_, legal*).

I wish to use C++20 coroutines instead of ASYNCIFY in future codebase. Asyncify consumes 40% of wasm-opt time.

Oct 06 '21 22:10 andrewevstyukhin

@andrewevstyukhin

For the user interface stuff, maybe open a new issue on this in emscripten?

I wish to use C++20 coroutines instead of ASYNCIFY in future codebase.

That is definitely recommended - Asyncify causes a large amount of overhead, and we hope to not need it once people can use either C++20 coroutines, or a future wasm stack switching proposal.

Oct 12 '21 00:10 kripken

Note: the output above, stuff like

[PassRunner]   running pass: precompute-propagate...           14.23460 seconds.

is emitted by BINARYEN_PASS_DEBUG=1 wasm-opt --no-validation. Pass-debug mode enables the timing, and disabling validation avoids it also validating after each pass (which can be slow, and is not what we're focused on in this issue).

Apr 27 '22 15:04 kripken

FWIW, I did an evening of profiling for our use case (i.e. linking the main .wasm of the Wonderland Engine runtime) and found that using the following structure instead of the 6 std::set in "EffectsAnalyzer"...

fake_set (std::vector with functions of a set)

Disclaimer: this is very ugly hacky code, just to prove a point.

```cpp
namespace std {

template<typename T>
struct fake_set : vector<T> {

    fake_set() {}

    void insert(const T& e) {
        push_back(e);
    }

    void erase(const T& e) {
        auto i = end();

        for(int i = size() - 1; i >= 0; --i) {
            auto it = begin() + i;
            if(*it == e) std::vector<T>::erase(it);
        }
    }

    size_t count(const T& k) const {
        size_t c = 0;
        for(const auto& e : *this) {
            if(e == k) ++c;
        }
        return c;
    }
};

}
```

... will yield the following almost 25-30% improvements for our use case when on Windows and built with MSVC 2019 on RelWithDebInfo config:

$ ./run-wasm-opt.sh
wasm-opt:

real    0m30.060s
user    0m0.000s
sys     0m0.015s
wasm-opt-fake_set:

real    0m21.832s
user    0m0.015s
sys     0m0.015s

Squareys@[...] /c/Repos/binaryen (main)
$ diff output-wasm-opt.wasmstripped output-wasm-opt-fake_set.wasmstripped

Squareys@[...] /c/Repos/binaryen (main)
$

Explanation: The sets are usually empty (or seem to have few elements) and while EffectsAnalyzer is always constructed in the Stack, e.g. for SimplifyLocals, std::set on the MSVC STL seems to always do a heap allocation in the default constructor, which is taking about 90% of the time in that pass.

This is, of course, very scrappy "benchmarking" (if one can even call it that) and no testing apart from the output diff. So take with many grains of salt, but I think the main takeaway could still be that red-black-tree based containers (e.g. std::map/std::set) allocate in the default constructor and for few elements it might be worth using a simpler container.

Similar optimizations might apply in other locations of the code.

The execution times were checked as follows

echo "wasm-opt:"
time ./build-ninja/bin/wasm-opt --strip-dwarf --post-emscripten -O3 --low-memory-unused --zero-filled-memory --strip-debug --strip-producers WonderlandRuntime-loader-physx.wasm -o WonderlandRuntime-loader-physx.wasmstripped --mvp-features --enable-mutable-globals
mv WonderlandRuntime-loader-physx.wasmstripped output-wasm-opt.wasmstripped

echo "wasm-opt-fake_set:"
time ./wasm-opt-fake_set --strip-dwarf --post-emscripten -O3 --low-memory-unused --zero-filled-memory --strip-debug --strip-producers WonderlandRuntime-loader-physx.wasm -o WonderlandRuntime-loader-physx.wasmstripped --mvp-features --enable-mutable-globals
mv WonderlandRuntime-loader-physx.wasmstripped output-wasm-opt-fake_set.wasmstripped

Jul 03 '22 21:07 Squareys

Could you try small_set in EffectsAnalyzer instead of your custom fake_set?

Jul 04 '22 05:07 MaxGraey

@MaxGraey small_set calls the default constructor of std::set, so it wouldn't remove the heap allocation 🤔

(That the heap allocation is the culprit was found through a sampling profiler (i.e. "Superluminal"), it's not the count() or insert() functions that are taking the time here. In most cases the set is empty. The difference for std::vector is, that it does not allocate in the default constructor for 0 elements).

Jul 04 '22 13:07 Squareys

Hmm. in this case I guess make sense use small_sets instead std::sets but init/alloc they lazily (by wrapping by std::unique_ptr for example) in EffectsAnalyzer

Jul 04 '22 13:07 MaxGraey

@MaxGraey Definitely, that would probably also improve other code that already uses small_set 👍

There also was a bunch of heap allocations around BasicBlock btw, I think the factory function there always allocates on the heap.

Is the change for small_set and use of small_set in EffectsAnalyzer something that would be accepted as a PR? (@kripken?) Can't promise to get to it too soon, but since the change could be fairly high impact, I think it'd be great to do (happy to sponsor it otherwise, if someone's up for that)

Jul 04 '22 15:07 Squareys

Also instead of std::set<Index> or small_set<Index> probably better use bit packed sets like std::bitset / dynamic_bitset. It should be faster. Perhaps create small_bitset

Jul 04 '22 16:07 MaxGraey

@Squareys The MSVC STL does an allocation in the default constructor of std::set? That's surprising...

It might be worth measuring your code on another STL, just as a comparison point. But, even if it just speeds up MSVC that might be enough of a reason to do this - 25%+ is a lot!

As for how to do this, I am a little worried about the fake-set approach since it has linear behavior on large sizes. Those large sizes are not happening on your testcase I guess, but in general they are possible. Some possible options:

A limited-size-fake-set, that aborts if the size grows beyond a point. That would only be usable on things that have a known maximum size, or where the optimizer knows that beyond some size it will bail out anyhow.
Wrap the std::set in SmallSet with std::optional. Then its constructor is only called when actually used. But, that might have a cost.
Avoid using the system STL. We could just bundle libc++ and get uniform behavior everywhere.

Jul 06 '22 17:07 kripken

The MSVC STL does an allocation in the default constructor of std::set? That's surprising...

@kripken Yes, stack trace is as follows with links to he Microsoft STL repository:

std::set default constructor calls _Mybase, a _Tree
_Tree(key_compare) calls _Alloc_sentinel_and_proxy()
_Alloc_sentinel_and_proxy() calls _Buyheadnode with the allocator, allocating a node

I honestly have trouble reading the style of code there, though, so please point out in case I'm misinterpreting this here. One way or another: the allocations were found through profiling with Superluminal, so they are certainly happening.

I'll attempt and contribute the SmallSet change, should be fairly straight forward. An alternative could be a custom allocator that stores the first allocated value as a member and copies that into the first larger allocation, but that's really complex compared to the SmallSet.

Aug 06 '22 19:08 Squareys

I found another interesting bottleneck where all threads seem to be blocked by I/O in one of the threads, a millisecond or more at a time, alternating between all threads saturated and this entirely blocked state:

Interestingly, the "I/O" here, is trap("unreachable"), which throws an exception and the catching seems to be causing a massive stall in the entire application. I unsure if there's a more hidden reason an exception needs to thrown here, since it's converted to a Flow(NONCONSTANT_FLOW) later anyway, returning that instead should be totally valid and sufficient as far as I can tell.

I'll try to apply this change and let you know how that goes.

Aug 07 '22 11:08 Squareys

Yeah, exceptions is definitely caused perf penalties. See this comment

Aug 07 '22 11:08 MaxGraey

Replacing the trap() (that throws the except) with return Flow(NONCONSTANT_FLOW) seems to work perfectly fine and removes all those stalls.

Aug 07 '22 12:08 Squareys

Finally, it seems like there is a couple of chunks of 1 second + where all the threads wait for work to complete on one of the workers. Investigating if there's a way to make that faster or balance the workload better:

It looks like a significant portion of time is spent in a heap allocation in doVisitLocalGet in LocalGraph.cpp while emplacing an element in the self->locations map.

=> Allocating LocalGraph::locations in the constructor upfront (and changing it from std::map to std::unordered_map in the process, unsure if this actually valid to do) with the amount of get and sets expected seems to reduce these stalls by ~30 %. I did this by creating a CountAll analogous to FindAll and running it on the func->body for set and get in MergeLocals, passing the count to a new parameter to LocalGraph for reserving in locations.

Happy to open a PR for this and the exception change, if you determine it valuable.

Aug 07 '22 12:08 Squareys

Replacing the trap() (that throws the except) with return Flow(NONCONSTANT_FLOW) seems to work perfectly fine and removes all those stalls.

I think that's a good idea. One thing that will need to be done though is to handle it in the fuzzer, I think: we need to log such traps (so we can compare them). But I think that might be a small addition in execution-results.h. A PR would be great!

The LocalGraph change sounds very interesting as well! I'd need to read the PR carefully but I think it can work.

Aug 08 '22 16:08 kripken

Amazing, will have time create those PRs tomorrow evening CEST then!

Aug 08 '22 20:08 Squareys

While not throwing an exception saves a few milliseconds and makes the profiles look much nicer, it's not really significant compared to the 16000 - 20000 milliseconds the entire wasm-opt pass is taking. I wonder if it's worth risking potentially breaking something here. I'll create a PR, but feel free to discard it, if you see any risk.

Aug 09 '22 15:08 Squareys

And the preallocation while in theory should be an improvement, is not significant enough to escape the variance in the timing results 🤔

Aug 09 '22 16:08 Squareys

After some more hours of investigation, I found another thing: Function in wasm.h is using 7 std::set and std::unordered_map, which for my case is being allocated 4799 times (potentially more often, increasing some static counter in the destructor, not sure if that's thread safe).

That's around 4799 function instantiations * 7 heap allocations * 0.15 ms/allocation = 5 seconds of cost for the default constructors alone (parallelized on multiple threads, though).

Printing the sizes of the sets and maps in the destructor shows that they are always empty? I saw some uses in the code though, so I'm not sure what's going on here. They might just never be hit for this binary.

Anyway, the solution to replace the Microsoft STL seems more and more useful to me 😅

Aug 09 '22 17:08 Squareys

In https://github.com/WebAssembly/binaryen/issues/4261 the idea came up to use a more efficient hashmap, which could be smaller and faster than the default stdlib one that we use atm. That would be great to look into.

Oct 11 '23 16:10 kripken

#6042 has some notes on LocalGraph improvements that could help a particularly large testcase.

Oct 23 '23 20:10 kripken

binaryen binaryen copied to clipboard

Make compilation faster :rocket:

binaryen
binaryen copied to clipboard