[inlining] increasing accuracy of the code size assessment model
Old code size measure is simply based on operation count in callee. It will miss the opportunity like
(func (param $ptr i32) (param $value i32)
local.get $ptr
local.get $value
i32.store
)
which is always happened for variable setter in OO language.
In this PR, we want to introduce inlining specific measurer to ignore this pattern for better inlining result.
The new measurer will ignore all occurrences of local.get that occur in sequence.
The behavior of the new measurer will be consistent with that of the old measurer in:
- When parameters are not retrieved in sequence, a new local variable will be introduced after inlining. At this point, to prevent an increase in code size,
- produce operand stack with
global.getorxxx.const - update the value of local with
local.set
In a word, the new measure want to match this pattern: The function body only contain local.get $param_n and trivial consumer operation like xxx.add xxx.store if etc.
Interesting idea. Do you have measurements on some benchmark perhaps? I'm not sure I have a good guess about this without data.
I use assemblyscript bootstrap as bench
wasm-opt flags: --shrink-level 2 --all-features -Oz
| wasm-opt | size (Bytes) |
|---|---|
| main@2b989ae | 791963 |
| after PR | 792066 (+103) |
| after PR but treat store as invalid | 791896 (-67) |
The store will use at least 3 bytes which is 1 byte larger than call, so inline it will increase 1 byte in WASM size. But it is depends on the frequency... But if we only keep the normal algorithm operation, It is definitely profitable.
And I think it is worth to inline if we want a balance in both shrink level and optimize level. It will not increase code size much and inlining can reduce the cost of function call and introduce more optimization opportunity later.
I just realized that I'm fixing the same issue with #7669 and #7670. (note: #7670 is based on #7669, it includes changes in #7669)
My fix is different: I don't change code size measurements, instead improve the existing "trivial call" heuristic.
@HerrCai0907 it would be interesting to see how much different #7670 makes on the same benchmark you run above, and whether it optimizes the same way.
(Note: I need to rebase #7670, which I'll do now.)
I think the PRs linked above should be "safer" in the sense that they won't cause accidentally inlining too much, because they don't change size calculations.
But even with those merged I think this PR may make sense. local.gets in the right order never increases call site sizes, and if the function size is an indicative of how large an inlined call site can get, then local.gets in the right order shouldn't contribute to the sizes.
Assuming we merge my PRs, I would make these changes in this PR:
- Remove the part that makes the function size 0. With my PRs we already inline these functions regardless of their sizes.
- Ignore ordered
local.gets, but:- For unused arguments, add 1 to the code size, for the potential
dropintroduced at the call sites. - For constants between
local.gets, add sizes of the constant instructions to the size.
- For unused arguments, add 1 to the code size, for the potential
(If we see a non-constant instruction between two local.gets, the rest of the argument need to be count as normal because new locals may need for each of the arguments after the non-constant instruction, to maintain evaluation order.)
So in a function like:
(func $f (param $x1) (param $x2) (param $x3)
(call $g
(i32.const 123)
(local.get $x2)
(i32.const 456)
(local.get $x3)))
We count i32.const instructions, we add 1 for the missing x1 (accounting for potential drops at the call sites), and ignore x2 and x3.
This can potentially cover code like: f calls g but with an extra constant argument passed:
void f(arg1, arg2, arg3) {
return g(arg1, arg2, arg3, 0);
}
This kind of thing can be generated when at the source code you have default arguments which some call sites omit, and the default value is a constant integer like 0 above.
In other words, consider "size" as the worst case code size contribution to call sites, rather than the size of the function being inlined.
(We could even consider the number of extra locals required when the function locals are not used in order, but I'm not bright enough to immediately see what kind of algorithm would be needed here.. and we're already probably getting very little benefit with all of this after the other two PRs linked above)