binaryen
binaryen copied to clipboard
Reduce threading overhead
Our threading overhead seems significant. When I measure a fixed pure computational workload, replacing the body of a pass like precompute to instead just do some silly work, then measuring with time, the user time is the same when BINARYEN_CORES=1 (use 1 core) and when running normally with all cores. That makes sense since the total actual work is added up in user, and it's the same. And there isn't much synchronization overhead that slows us down.
But that's not the typical case when running real passes, the user for multi-core can be much higher, see e.g. https://github.com/WebAssembly/binaryen/pull/2733#issuecomment-611246791 and I see similar things locally with user being 2-3 larger when using 8 threads.
This may be a large speedup opportunity. One possibility is that we often have many tiny functions, and maybe switching between them is costly? Or maybe there is contention on locks (see that last link, but this happens even after that PR which should get rid of that).
The thread-pool using code for running passes on functions is here: https://github.com/WebAssembly/binaryen/blob/dc5a503c4d54dc71ab46535c1966540785562dd7/src/passes/pass.cpp#L591
Some TODOs along this vein:
- [ ] Store function parameter types separately rather than all together as a tuple
- [ ] Investigate the performance impact of having
Typecontain aSmallVecrather than an index
As multivalue becomes common, we will also want to:
- [ ] Investigate thread-local caching of commonly accessed types
Measuring with perf stat after https://github.com/WebAssembly/binaryen/pull/2745 things look a lot better. user time is still higher than expected, but after investigating with perf I suspect that might be slightly misleading (or maybe perf is wrong...).
It does seem though that we gain almost nothing from using the reported system cores versus half of them. My guess is hyperthreading doesn't really help us since we are very CPU work bound (no I/O to wait on, and we are cache-friendly by having small data structures and running as many passes as possible on a single function before moving on to the next). But I'm not sure we can do anything about that.
fwiw, still seeing huge overhead with the v105 release
$ getconf _NPROCESSORS_ONLN
72
$ unset BINARYEN_CORES
$ time ./binaryen-version_105/bin/wasm-opt -O2 test -o test.wasm
real 0m21.541s
user 1m5.779s
sys 19m53.180s
$ export BINARYEN_CORES=1
$ time ./binaryen-version_105/bin/wasm-opt -O2 test -o test.wasm
real 0m8.487s
user 0m8.292s
sys 0m0.199s
$ du -h test test.wasm
2.4M test
1.4M test.wasm