binaryen icon indicating copy to clipboard operation
binaryen copied to clipboard

Massive memory growth for -O3, -O4, -Os and -Oz, in ssa-nomerge

Open Jacalz opened this issue 7 months ago • 7 comments

Hi. I am planning on applying wasm-opt to my Go project as part of https://github.com/Jacalz/hegelmote/pull/30. The Go WASM files are generally quite big and our toolkit is not super optimised for WASM yet so I wanted to optimise it. As I was interested in seeing which level made the most sense, I aimed to run all of the levels to see how long they take and how they affect the binary size. What I unfortunately found out was that O1 and O2 were the only ones that were usable as -O3, -O4, -Os and -Oz all seem to use memory exponentially until all 32 GB are full. I have attached the file below if you wish to try and reproduce it.

Results:

  • Baseline: 37.3 MB
  • O1: 35.7 MB, 29.85 s
  • O2: 35.7 MB, 32.78 secs
  • O3: DNF
  • O4: DNF
  • Os: DNF
  • Oz: DNF

Running:

wasm-opt Hegelmote.wasm --enable-bulk-memory-opt -O3 -o O3.wasm

Potentially related to https://github.com/WebAssembly/binaryen/issues/3646.

Reference file: hegelmote-wasm.zip

Jacalz avatar Jun 07 '25 21:06 Jacalz

I think this problem likely applies to any app built with Fyne: https://github.com/fyne-io/fyne

Jacalz avatar Jun 08 '25 21:06 Jacalz

Looking at the output with BINARYEN_PASS_DEBUG=1 in the env, ssa-nomerge is very slow, and uses over 5 GB even when using a single core. The function that happens on is github.com_yuin_goldmark_util.map.init.0. That must be a massive function but the module is difficult to print, e.g. wabt errors:

$ ~/Dev/wabt/build/wasm2wat Hegelmote.wasm --enable-all -o a.wat
error: label stack exceeds max nesting depth
0b40c8c: error: OnBlockExpr callback failed

We could add arbitrary limits somehow (as discussed in https://github.com/WebAssembly/binaryen/issues/3646) but I'm not sure what the best ones would be.

Using just -O2 might be best for now. Note that you can add more passes manually, e.g. -O2 --precompute-propagate -O2 runs one pass after -O2 (that pass is normally run only in -O3), and optimize after it as well. Other passes you can try are optimize-added-constants-propagate, merge-locals, etc., see pass.cpp which has checks like options.optimizeLevel >= 3 (-O3+).

kripken avatar Jun 09 '25 18:06 kripken

Thanks for the detailed analysis of the issue. Nicely found regarding that function; I started digging into the package (github.com/yuin/goldmark is the markdown parser we are using in the toolkit) and it made me notice that one of the latest releases contained this commit: https://github.com/yuin/goldmark/commit/6951ff3bd1825c85dc9a0197ff9b33825588c26f

I updated to the latest version of the dependency and all of the build levels in wasm-opt now work as expected again. Should this be closed then or does it perhaps make sense to fix the memory leak or whatever it was? It made my computer crash last time so I kind of feel like it is something that shouldn’t be happening.

Updated binary for reference: hegelmote-wasm-fixed.zip

Jacalz avatar Jun 09 '25 20:06 Jacalz

Thanks for the update, good to know things work now.

I agree things shouldn't crash like that, but I'm not sure what we can do. The problem isn't a memory leak, but very large memory usage - we could put some arbitrary limit, but other people might have more than enough memory and not want to be limited that way.

Ideally the OS would not crash on an OOM like this, but as a Linux user I've seen that happen now and then, unfortunately, in various apps...

Let's keep this open for more ideas here. I'll rename.

kripken avatar Jun 09 '25 20:06 kripken

Happy to help. Maybe I should clarify that it wasn't directly a crash but more like the oom-daemon (or whatever Fedora uses) not killing it so everything grinded to a halt as more and more swap space was requested. I think I saw 8GB swap being used on top of the 32GB main memory :)

I agree that setting a limit on memory usage is a bad idea. If it isn't a memory leak, is it an algorithm that is O(n^2) in memory complexity or something like that? I feel like using 32GB+ to parse a ~37MB file is a bit excessive how ever you look at it. Perhaps there is a way to rewrite that pass to not allocate as much memory?

Jacalz avatar Jun 09 '25 21:06 Jacalz

Yeah, I've seen similar OOM behavior on Ubuntu, sometimes forcing a reboot...

I suspect it is O(n^2) in the worst case, yes. The pass computes the "influences" of sets on gets, that is, which gets read from which set. Commonly there are few such interactions, but if there are many, it is quadratic.

In terms of raw numbers, that function has 6,125 locals, 46,041 gets, and 24,492 sets, so it is a very extreme case...

In theory the pass could be rewritten to work "online", and not compute all locals first, but it does consider interactions as well (merges) so I'm not sure offhand how. But let's keep the issue open for more investigation.

kripken avatar Jun 10 '25 16:06 kripken

A point of interest:

$ bin/wasm-opt ~/Downloads/hegelmote-wasm/Hegelmote.wasm --extract-function=github.com_yuin_goldmark_util.map.init.0 --metrics --no-validation -O1 --metrics --ssa-nomerge --metrics
extracting github.com_yuin_goldmark_util.map.init.0
Metrics
total
 [exports]      : 1       
 [funcs]        : 1       
 [globals]      : 2       
 [imports]      : 16943   
 [memories]     : 1       
 [memory-data]  : 17782489
 [tables]       : 0       
 [tags]         : 0       
 [total]        : 323974  
 [vars]         : 6125    
 Binary         : 10719   
 Block          : 15308   
 Break          : 6122    
 Call           : 6122    
 Const          : 132271  
 GlobalGet      : 6124    
 GlobalSet      : 4595    
 If             : 3061    
 Load           : 13775   
 LocalGet       : 46041   
 LocalSet       : 24492   
 Loop           : 1       
 Return         : 1       
 Store          : 23079   
 Switch         : 1       
 Unary          : 32261   
 Unreachable    : 1       
Metrics
total
 [exports]      : 1       
 [funcs]        : 1       
 [globals]      : 1             -1
 [imports]      : 7         -16936
 [memories]     : 1       
 [memory-data]  : 17782470     -19
 [tables]       : 0       
 [tags]         : 0       
 [total]        : 293359    -30615
 [vars]         : 2          -6123
 Binary         : 10719   
 Block          : 9186       -6122
 Break          : 6122    
 Call           : 6122    
 Const          : 132271  
 GlobalGet      : 6123          -1
 GlobalSet      : 4595    
 If             : 3061    
 Load           : 13775   
 LocalGet       : 41450      -4591
 LocalSet       : 22959      -1533
 Loop           : 1       
 Return         : 1       
 Store          : 23079   
 Switch         : 1       
 Unary          : 13893     -18368
 Unreachable    : 1       

That focuses on this one function ("extracting" it) by turning the rest into imports. -O1 then removes almost all vars: only 2 remain, 6,123 are removed, but we do not remove that many gets and sets. And the speed on the later ssa-nomerge is no better. So the problem is not the number of locals, but that there are very many gets and sets, with lots of overlap between them, in a huge CFG (10K blocks, 3K ifs).

kripken avatar Jun 10 '25 16:06 kripken