binaryen [inlining] increasing accuracy of the code size assessment model

Old code size measure is simply based on operation count in callee. It will miss the opportunity like

(func (param $ptr i32) (param $value i32)
  local.get  $ptr
  local.get $value
  i32.store
)

which is always happened for variable setter in OO language.

In this PR, we want to introduce inlining specific measurer to ignore this pattern for better inlining result. The new measurer will ignore all occurrences of local.get that occur in sequence. The behavior of the new measurer will be consistent with that of the old measurer in:

When parameters are not retrieved in sequence, a new local variable will be introduced after inlining. At this point, to prevent an increase in code size,
produce operand stack with global.get or xxx.const
update the value of local with local.set

In a word, the new measure want to match this pattern: The function body only contain local.get $param_n and trivial consumer operation like xxx.add xxx.store if etc.

Jun 10 '25 14:06 HerrCai0907

Interesting idea. Do you have measurements on some benchmark perhaps? I'm not sure I have a good guess about this without data.

Jun 11 '25 17:06 kripken

I use assemblyscript bootstrap as bench wasm-opt flags: --shrink-level 2 --all-features -Oz

wasm-opt	size (Bytes)
main@2b989ae	791963
after PR	792066 (+103)
after PR but treat store as invalid	791896 (-67)

The store will use at least 3 bytes which is 1 byte larger than call, so inline it will increase 1 byte in WASM size. But it is depends on the frequency... But if we only keep the normal algorithm operation, It is definitely profitable.

Jun 12 '25 13:06 HerrCai0907

And I think it is worth to inline if we want a balance in both shrink level and optimize level. It will not increase code size much and inlining can reduce the cost of function call and introduce more optimization opportunity later.

Jun 12 '25 13:06 HerrCai0907

I just realized that I'm fixing the same issue with #7669 and #7670. (note: #7670 is based on #7669, it includes changes in #7669)

My fix is different: I don't change code size measurements, instead improve the existing "trivial call" heuristic.

@HerrCai0907 it would be interesting to see how much different #7670 makes on the same benchmark you run above, and whether it optimizes the same way.

(Note: I need to rebase #7670, which I'll do now.)

Jun 23 '25 13:06 osa1

I think the PRs linked above should be "safer" in the sense that they won't cause accidentally inlining too much, because they don't change size calculations.

But even with those merged I think this PR may make sense. local.gets in the right order never increases call site sizes, and if the function size is an indicative of how large an inlined call site can get, then local.gets in the right order shouldn't contribute to the sizes.

Assuming we merge my PRs, I would make these changes in this PR:

Remove the part that makes the function size 0. With my PRs we already inline these functions regardless of their sizes.
Ignore ordered local.gets, but:
- For unused arguments, add 1 to the code size, for the potential drop introduced at the call sites.
- For constants between local.gets, add sizes of the constant instructions to the size.

(If we see a non-constant instruction between two local.gets, the rest of the argument need to be count as normal because new locals may need for each of the arguments after the non-constant instruction, to maintain evaluation order.)

So in a function like:

(func $f (param $x1) (param $x2) (param $x3)
  (call $g
    (i32.const 123)
    (local.get $x2)
    (i32.const 456)
    (local.get $x3)))

We count i32.const instructions, we add 1 for the missing x1 (accounting for potential drops at the call sites), and ignore x2 and x3.

This can potentially cover code like: f calls g but with an extra constant argument passed:

void f(arg1, arg2, arg3) {
    return g(arg1, arg2, arg3, 0);
}

This kind of thing can be generated when at the source code you have default arguments which some call sites omit, and the default value is a constant integer like 0 above.

In other words, consider "size" as the worst case code size contribution to call sites, rather than the size of the function being inlined.

(We could even consider the number of extra locals required when the function locals are not used in order, but I'm not bright enough to immediately see what kind of algorithm would be needed here.. and we're already probably getting very little benefit with all of this after the other two PRs linked above)

Jun 25 '25 13:06 osa1