binaryen icon indicating copy to clipboard operation
binaryen copied to clipboard

wasm-opt: empty function is not inlined/removed

Open Anonymity20202 opened this issue 1 year ago • 11 comments

The code example:

#include <stdint.h>
#include <stdio.h>
struct a {
  int8_t b
};
struct c {
  int16_t b
} d = {13015};
int32_t g_3[4];
uint16_t e = 57582;
struct c f;
int32_t *const g(int32_t *, int32_t, int32_t *);
static struct a func_14(int32_t *, int32_t *, int32_t *, struct c, int32_t *);
int64_t h() {
  int32_t i;
  int64_t m = 113;
  int32_t j = &g_3[2];
  g((func_14(g_3, i, 0, d, g_3), i), m, j);
}
int32_t *const g(int32_t *, int32_t k, int32_t *l) {
  int32_t r;
  *l = k + e;
  *l = (func_14(0, 0, r, f, g_3), 0);
}
struct a func_14(int32_t *n, int32_t *o, int32_t *p, struct c q, int32_t *t) {}
void main() {
  printf("hello\n");
  int s = 0;
  h();
  printf("g_3: %d %d %d %d\n", g_3[0], g_3[1], g_3[2], g_3[3]);
}

Version of wasm-opt:

wasm-opt version 109 (version_109-8-gbab21052c)

wasm-opt option: wasm-opt --post-emscripten -O3 --low-memory-unused --zero-filled-memory --strip-producers -g --mvp-features --converge --inlining-optimizing --local-cse --code-folding --licm --rse --precompute-propagate --optimize-added-constants-propagate

func_14 after wasm-opt optimzation passes:

  (func $func_14 (type 7) (param i32 i32 i32 i32 i32) (result i32)
    (local i32)
    global.get $__stack_pointer
    i32.const 32
    i32.sub
    local.tee 5
    local.get 3
    i32.store16 offset=16
    local.get 5
    local.get 0
    i32.store offset=12
    local.get 5
    local.get 1
    i32.store offset=8
    local.get 5
    local.get 2
    i32.store offset=4
    local.get 5
    local.get 4
    i32.store
    local.get 5
    i32.load8_u offset=24)

Problem Description: I try to compile the above code with emcc -O0 -g and then optimize the generated wasm code with wasm-opt, and I expect the empty function func_14 will be inlined/optimized. wasm-opt seems to fail to remove or inline this function even when --inlining-optimizing is enabled. Not sure if this is a feature of wasm optimizer or an under-optimization problem.

Anonymity20202 avatar Aug 20 '22 12:08 Anonymity20202

func_14 return struct which emulated via shadow stack and linear memory. In wasm store / load operations can't be removed due to side effects (may potentially produce a OOB trap).

MaxGraey avatar Aug 20 '22 12:08 MaxGraey

Thanks for answering!

In wasm store / load operations can't be removed due to side effects

This is interesting to me. Does this mean all store/load instructions in wasm code are actually non-optimizable? Or it may cause additional risk once these instructions are optimized?

Also, as for the func_14 in the above code example, $__stack_pointer is read but not written. In my understanding, the value is stored outside the stack and never used. The return value of this function is also not used:

    ...
    call $func_14
    local.get 2
    i32.load offset=20
    local.set 12
    ...
    call $func_14
    local.get 5
    i32.load offset=16
    local.set 18
    ...

Besides, If the C source code is compiled with emcc -O3, func_14 will be inlined and optimized. Thus, I think there may be potential optimization opportunities with this function?

Anonymity20202 avatar Aug 20 '22 13:08 Anonymity20202

Also, as for the func_14 in the above code example, $__stack_pointer is read but not written

The result's of global.get $__stack_pointer used to calculate value which stores in i32.store16. It's just one pulling the other. But the problem here is also that the function itself is quite large and cannot be inlined and therefore result cannot be wrapped in "drop" afterwards, which prevents the optimizer from removing / cleanup some dead code.

Ah btw in all call sites result of func_14 always use

MaxGraey avatar Aug 20 '22 14:08 MaxGraey

Does this mean all store/load instructions in wasm code are actually non-optimizable?

Dead loads and stores could be eliminated probably with --traps-never-happen. But which may produce some other issues

MaxGraey avatar Aug 20 '22 16:08 MaxGraey

Not sure if I understood you correctly, but I think the result of func_14is not used in all call sites, e.g., *l = (func_14(0, 0, r, f, g_3), 0); is equivalent to func_14(0, 0, r, f, g_3); *l = 0;. I understand that the function inlining depends on some heuristics and the size of a function may prevent it from being inlined. But since all instructions in func_14 have no effect on the result of execution and thus could be optimized safely, I wonder if it is possible to improve the wasm optimizer by combining more advanced heuristics/techniques across different optimization passes (dead code elimination or OptimizeInstructions)?

Anonymity20202 avatar Aug 21 '22 08:08 Anonymity20202

I see. It's *l = (func_14(0, 0, r, f, g_3), 0). It seemed to me that *l = func_14(0, 0, r, f, g_3).

MaxGraey avatar Aug 21 '22 08:08 MaxGraey

Also, how does wasm-opt defines whether an instruction is dead or not? #4947 as shown in this example, some instructions that are considered dead code by traditional C compilers may not be optimized away by wasm-opt.

Anonymity20202 avatar Aug 21 '22 08:08 Anonymity20202

https://github.com/WebAssembly/binaryen/issues/4947 as shown in this example, some instructions that are considered dead code by traditional C compilers may not be optimized away by wasm-opt.

LLVM doesn't use global explicitly. It stores such global in linear memory and uses load / store ops instead global.get / global.set. See: https://godbolt.org/z/r96vW9c5d

MaxGraey avatar Aug 21 '22 08:08 MaxGraey

LLVM doesn't use global explicitly.

Yes, I have noticed this. I am not sure if it is possible, but maybe the optimization upper bound of wasm-opt could be improved with techniques like data-flow analysis or others? So that such load / store could also be optimized? As the wasm language is emphasized on near native performance, users might expect that the wasm code being run can also be optimized to a near-optimal state.

Anonymity20202 avatar Aug 21 '22 09:08 Anonymity20202

In general, many languages do dead code elimination and other hight level transformation which highly depends on lang context before translating to LLVM IR. For example, using own Middle IR (MIR). LLVM also has infrastructure for this. I'ts called MLIR

MaxGraey avatar Aug 21 '22 09:08 MaxGraey

I think those stores are what LLVM emits in debug mode for locals. In -O0 locals are stored on the stack and not in wasm registers. That is why we can see in that code that it allocates some stack space and then writes each of the parameters into memory there.

Binaryen can't remove such stores atm. It is difficult since unlike LLVM we don't have a separate stack from the heap. We also don't have the undefined behavior LLVM can depend on (that an overflow of a pointer from the heap can't reach the stack, etc.). In other words, from Binaryen's point of view those are just stores to memory, and for all it knows that memory might be read by something else later.

Running LLVM with -O1 would avoid this as it would do mem2reg which converts stack allocations like this into registers which then become wasm locals. Binaryen has very good optimizations for those.

Note that with wasm GC this will get a lot more optimizable (since there is a lot less aliasing). But without wasm GC, in general Binaryen simply can do a lot less than LLVM, because it operates on wasm IR which is lower than LLVM IR. (But with wasm GC, wasm IR is effectively higher-level.)

We have experimented with dead store elimination in Binaryen (https://github.com/WebAssembly/binaryen/pull/3858) which might help things like this, but it requires global analysis and is not guaranteed to work in all cases. It hasn't been useful enough to finish and land so far.

kripken avatar Aug 22 '22 16:08 kripken