binaryen
binaryen copied to clipboard
wasm-opt: empty function is not inlined/removed
The code example:
#include <stdint.h>
#include <stdio.h>
struct a {
int8_t b
};
struct c {
int16_t b
} d = {13015};
int32_t g_3[4];
uint16_t e = 57582;
struct c f;
int32_t *const g(int32_t *, int32_t, int32_t *);
static struct a func_14(int32_t *, int32_t *, int32_t *, struct c, int32_t *);
int64_t h() {
int32_t i;
int64_t m = 113;
int32_t j = &g_3[2];
g((func_14(g_3, i, 0, d, g_3), i), m, j);
}
int32_t *const g(int32_t *, int32_t k, int32_t *l) {
int32_t r;
*l = k + e;
*l = (func_14(0, 0, r, f, g_3), 0);
}
struct a func_14(int32_t *n, int32_t *o, int32_t *p, struct c q, int32_t *t) {}
void main() {
printf("hello\n");
int s = 0;
h();
printf("g_3: %d %d %d %d\n", g_3[0], g_3[1], g_3[2], g_3[3]);
}
Version of wasm-opt:
wasm-opt version 109 (version_109-8-gbab21052c)
wasm-opt option:
wasm-opt --post-emscripten -O3 --low-memory-unused --zero-filled-memory --strip-producers -g --mvp-features --converge --inlining-optimizing --local-cse --code-folding --licm --rse --precompute-propagate --optimize-added-constants-propagate
func_14 after wasm-opt optimzation passes:
(func $func_14 (type 7) (param i32 i32 i32 i32 i32) (result i32)
(local i32)
global.get $__stack_pointer
i32.const 32
i32.sub
local.tee 5
local.get 3
i32.store16 offset=16
local.get 5
local.get 0
i32.store offset=12
local.get 5
local.get 1
i32.store offset=8
local.get 5
local.get 2
i32.store offset=4
local.get 5
local.get 4
i32.store
local.get 5
i32.load8_u offset=24)
Problem Description:
I try to compile the above code with emcc -O0 -g
and then optimize the generated wasm code with wasm-opt
, and I expect the empty function func_14
will be inlined/optimized. wasm-opt
seems to fail to remove or inline this function even when --inlining-optimizing
is enabled. Not sure if this is a feature of wasm optimizer or an under-optimization problem.
func_14
return struct which emulated via shadow stack and linear memory. In wasm store
/ load
operations can't be removed due to side effects (may potentially produce a OOB trap).
Thanks for answering!
In wasm
store
/load
operations can't be removed due to side effects
This is interesting to me. Does this mean all store
/load
instructions in wasm code are actually non-optimizable? Or it may cause additional risk once these instructions are optimized?
Also, as for the func_14
in the above code example, $__stack_pointer
is read but not written. In my understanding, the value is stored outside the stack and never used. The return value of this function is also not used:
...
call $func_14
local.get 2
i32.load offset=20
local.set 12
...
call $func_14
local.get 5
i32.load offset=16
local.set 18
...
Besides, If the C source code is compiled with emcc -O3
, func_14
will be inlined and optimized. Thus, I think there may be potential optimization opportunities with this function?
Also, as for the func_14 in the above code example, $__stack_pointer is read but not written
The result's of global.get $__stack_pointer
used to calculate value which stores in i32.store16
. It's just one pulling the other. But the problem here is also that the function itself is quite large and cannot be inlined and therefore result cannot be wrapped in "drop" afterwards, which prevents the optimizer from removing / cleanup some dead code.
Ah btw in all call sites result of func_14
always use
Does this mean all store/load instructions in wasm code are actually non-optimizable?
Dead loads and stores could be eliminated probably with --traps-never-happen
. But which may produce some other issues
Not sure if I understood you correctly, but I think the result of func_14
is not used in all call sites, e.g., *l = (func_14(0, 0, r, f, g_3), 0);
is equivalent to func_14(0, 0, r, f, g_3); *l = 0;
.
I understand that the function inlining depends on some heuristics and the size of a function may prevent it from being inlined. But since all instructions in func_14
have no effect on the result of execution and thus could be optimized safely, I wonder if it is possible to improve the wasm optimizer by combining more advanced heuristics/techniques across different optimization passes (dead code elimination or OptimizeInstructions)?
I see. It's *l = (func_14(0, 0, r, f, g_3), 0)
. It seemed to me that *l = func_14(0, 0, r, f, g_3)
.
Also, how does wasm-opt defines whether an instruction is dead or not? #4947 as shown in this example, some instructions that are considered dead code by traditional C compilers may not be optimized away by wasm-opt.
https://github.com/WebAssembly/binaryen/issues/4947 as shown in this example, some instructions that are considered dead code by traditional C compilers may not be optimized away by wasm-opt.
LLVM doesn't use global explicitly. It stores such global in linear memory and uses load
/ store
ops instead global.get
/ global.set
. See: https://godbolt.org/z/r96vW9c5d
LLVM doesn't use global explicitly.
Yes, I have noticed this. I am not sure if it is possible, but maybe the optimization upper bound of wasm-opt could be improved with techniques like data-flow analysis or others? So that such load
/ store
could also be optimized?
As the wasm language is emphasized on near native performance, users might expect that the wasm code being run can also be optimized to a near-optimal state.
In general, many languages do dead code elimination and other hight level transformation which highly depends on lang context before translating to LLVM IR. For example, using own Middle IR (MIR). LLVM also has infrastructure for this. I'ts called MLIR
I think those stores are what LLVM emits in debug mode for locals. In -O0
locals are stored on the stack and not in wasm registers. That is why we can see in that code that it allocates some stack space and then writes each of the parameters into memory there.
Binaryen can't remove such stores atm. It is difficult since unlike LLVM we don't have a separate stack from the heap. We also don't have the undefined behavior LLVM can depend on (that an overflow of a pointer from the heap can't reach the stack, etc.). In other words, from Binaryen's point of view those are just stores to memory, and for all it knows that memory might be read by something else later.
Running LLVM with -O1
would avoid this as it would do mem2reg
which converts stack allocations like this into registers which then become wasm locals. Binaryen has very good optimizations for those.
Note that with wasm GC this will get a lot more optimizable (since there is a lot less aliasing). But without wasm GC, in general Binaryen simply can do a lot less than LLVM, because it operates on wasm IR which is lower than LLVM IR. (But with wasm GC, wasm IR is effectively higher-level.)
We have experimented with dead store elimination in Binaryen (https://github.com/WebAssembly/binaryen/pull/3858) which might help things like this, but it requires global analysis and is not guaranteed to work in all cases. It hasn't been useful enough to finish and land so far.