Massive memory growth for -O3, -O4, -Os and -Oz, in ssa-nomerge
Hi. I am planning on applying wasm-opt to my Go project as part of https://github.com/Jacalz/hegelmote/pull/30. The Go WASM files are generally quite big and our toolkit is not super optimised for WASM yet so I wanted to optimise it. As I was interested in seeing which level made the most sense, I aimed to run all of the levels to see how long they take and how they affect the binary size. What I unfortunately found out was that O1 and O2 were the only ones that were usable as -O3, -O4, -Os and -Oz all seem to use memory exponentially until all 32 GB are full. I have attached the file below if you wish to try and reproduce it.
Results:
- Baseline: 37.3 MB
- O1: 35.7 MB, 29.85 s
- O2: 35.7 MB, 32.78 secs
- O3: DNF
- O4: DNF
- Os: DNF
- Oz: DNF
Running:
wasm-opt Hegelmote.wasm --enable-bulk-memory-opt -O3 -o O3.wasm
Potentially related to https://github.com/WebAssembly/binaryen/issues/3646.
Reference file: hegelmote-wasm.zip
I think this problem likely applies to any app built with Fyne: https://github.com/fyne-io/fyne
Looking at the output with BINARYEN_PASS_DEBUG=1 in the env, ssa-nomerge is very slow, and uses over 5 GB even when using a single core. The function that happens on is github.com_yuin_goldmark_util.map.init.0. That must be a massive function but the module is difficult to print, e.g. wabt errors:
$ ~/Dev/wabt/build/wasm2wat Hegelmote.wasm --enable-all -o a.wat
error: label stack exceeds max nesting depth
0b40c8c: error: OnBlockExpr callback failed
We could add arbitrary limits somehow (as discussed in https://github.com/WebAssembly/binaryen/issues/3646) but I'm not sure what the best ones would be.
Using just -O2 might be best for now. Note that you can add more passes manually, e.g. -O2 --precompute-propagate -O2 runs one pass after -O2 (that pass is normally run only in -O3), and optimize after it as well. Other passes you can try are optimize-added-constants-propagate, merge-locals, etc., see pass.cpp which has checks like options.optimizeLevel >= 3 (-O3+).
Thanks for the detailed analysis of the issue. Nicely found regarding that function; I started digging into the package (github.com/yuin/goldmark is the markdown parser we are using in the toolkit) and it made me notice that one of the latest releases contained this commit: https://github.com/yuin/goldmark/commit/6951ff3bd1825c85dc9a0197ff9b33825588c26f
I updated to the latest version of the dependency and all of the build levels in wasm-opt now work as expected again. Should this be closed then or does it perhaps make sense to fix the memory leak or whatever it was? It made my computer crash last time so I kind of feel like it is something that shouldn’t be happening.
Updated binary for reference: hegelmote-wasm-fixed.zip
Thanks for the update, good to know things work now.
I agree things shouldn't crash like that, but I'm not sure what we can do. The problem isn't a memory leak, but very large memory usage - we could put some arbitrary limit, but other people might have more than enough memory and not want to be limited that way.
Ideally the OS would not crash on an OOM like this, but as a Linux user I've seen that happen now and then, unfortunately, in various apps...
Let's keep this open for more ideas here. I'll rename.
Happy to help. Maybe I should clarify that it wasn't directly a crash but more like the oom-daemon (or whatever Fedora uses) not killing it so everything grinded to a halt as more and more swap space was requested. I think I saw 8GB swap being used on top of the 32GB main memory :)
I agree that setting a limit on memory usage is a bad idea. If it isn't a memory leak, is it an algorithm that is O(n^2) in memory complexity or something like that? I feel like using 32GB+ to parse a ~37MB file is a bit excessive how ever you look at it. Perhaps there is a way to rewrite that pass to not allocate as much memory?
Yeah, I've seen similar OOM behavior on Ubuntu, sometimes forcing a reboot...
I suspect it is O(n^2) in the worst case, yes. The pass computes the "influences" of sets on gets, that is, which gets read from which set. Commonly there are few such interactions, but if there are many, it is quadratic.
In terms of raw numbers, that function has 6,125 locals, 46,041 gets, and 24,492 sets, so it is a very extreme case...
In theory the pass could be rewritten to work "online", and not compute all locals first, but it does consider interactions as well (merges) so I'm not sure offhand how. But let's keep the issue open for more investigation.
A point of interest:
$ bin/wasm-opt ~/Downloads/hegelmote-wasm/Hegelmote.wasm --extract-function=github.com_yuin_goldmark_util.map.init.0 --metrics --no-validation -O1 --metrics --ssa-nomerge --metrics
extracting github.com_yuin_goldmark_util.map.init.0
Metrics
total
[exports] : 1
[funcs] : 1
[globals] : 2
[imports] : 16943
[memories] : 1
[memory-data] : 17782489
[tables] : 0
[tags] : 0
[total] : 323974
[vars] : 6125
Binary : 10719
Block : 15308
Break : 6122
Call : 6122
Const : 132271
GlobalGet : 6124
GlobalSet : 4595
If : 3061
Load : 13775
LocalGet : 46041
LocalSet : 24492
Loop : 1
Return : 1
Store : 23079
Switch : 1
Unary : 32261
Unreachable : 1
Metrics
total
[exports] : 1
[funcs] : 1
[globals] : 1 -1
[imports] : 7 -16936
[memories] : 1
[memory-data] : 17782470 -19
[tables] : 0
[tags] : 0
[total] : 293359 -30615
[vars] : 2 -6123
Binary : 10719
Block : 9186 -6122
Break : 6122
Call : 6122
Const : 132271
GlobalGet : 6123 -1
GlobalSet : 4595
If : 3061
Load : 13775
LocalGet : 41450 -4591
LocalSet : 22959 -1533
Loop : 1
Return : 1
Store : 23079
Switch : 1
Unary : 13893 -18368
Unreachable : 1
That focuses on this one function ("extracting" it) by turning the rest into imports. -O1 then removes almost all vars: only 2 remain, 6,123 are removed, but we do not remove that many gets and sets. And the speed on the later ssa-nomerge is no better. So the problem is not the number of locals, but that there are very many gets and sets, with lots of overlap between them, in a huge CFG (10K blocks, 3K ifs).