binaryen
binaryen copied to clipboard
List of missing peephole optimizations
Is it possible doing some trivial logical simplifications for now? Or better wait souper with z3 solver for that? like simplify from:
(func $test (export "test") (type $t0) (param $p0 i32) (result i32)
get_local $p0
i32.const 10
i32.gt_s
get_local $p0
i32.const 10
i32.eq
i32.or)
to:
(func $test (export "test") (type $t0) (param $p0 i32) (result i32)
get_local $p0
i32.const 9
i32.gt_s)
or:
(func $test (export "test") (type $t0) (param $p0 i32) (result i32)
get_local $p0
i32.const 10
i32.ge_s)
DONE: from:
get_global $i
i32.const 0
i32.le_s
i32.eqz
if
to:
get_global $i
i32.const 0
i32.gt_s
if
Good idea, we should do these!
It's not hard to add them. The correct pass is OptimizeInstructions (src/passes/OptimizeInstructions.cpp). Do you want to do it?
Oh, great! I think better if you start some basics and may be after I add additional methods/rules for that pass =)
@MaxGraey Sounds good, I can get the stuff set up so that it's just a matter of adding the specific rules into that pattern. Take a look at https://github.com/WebAssembly/binaryen/commit/d3e5d4416f1bcb8fe3998db39351be766ba250aa - I added the first rule into combineOr there, and a test in $combine-or. The rest should be straightforward, let me know if you need any help.
(I realized I need to land a few other things before that commit, so I can't land it right now, but posting this now so you can get started if you want.)
Great! Are you plan merge this to master? Or I can just fork oi2 and continue work in separate branch?
btw I use one trick with range checking if all arguments is unsigned. Like:
(x >= C0) & (x <= C1)
where x, C0 and C1 are unsigned ints
(unsigned)x - C0 <= C1 - C0
Also integer sequence of logical expressions like:
(x == C) | (x == C + 1) | ... | (x == N)
could transform to:
(unsigned)(x - C) <= N - C
is it make sense implement this rule as well, wdyt?
@MaxGraey
Yes, for now fork oi2 and work on that, since I might not be able to land it today.
That unsigned trick looks valid to me. Maybe leave it for a separate PR afterwards though, so there isn't too much in one?
Also figure outing some useful rules for integer (sing/unsign) comparators:
x > x - 1
could always transform to:
x != MIN_VALUE
and
x + 1 > x
could always transform to:
x != MAX_VALUE
Also wdyt using for x != 0:
get_local $0
i32.eqz
i32.eqz
instead
get_local $0
i32.const 0
i32.ne
This reducing code by 1 byte, but generate worse machine code so this probably useful only for shrink code >= 2. Of course this not very likely code because if/else/select don't require this operation explicitly
Comment before last looks good to me.
Last comment also looks good - as you said in if or select etc. inputs we can flip the arms and avoid != 0 anyhow, but there are cases when it happens elsewhere, so i think it's a good rule to add.
@MaxGraey how are you getting along with these optimizations? Anything I can help with?
@kripken Sorry, currently I busy on work but I hope I will return to this on weekend
Just note. Found another possible improvments like transform this:
i32.const 1
i32.const 0
get_local $p0
select
or this:
get_local $p0
if $I0 (result i32)
i32.const 1
else
i32.const 0
end
to this (depending on $p0 type)
get_local $p0
i32.const 0
i32.ne
UPDATE
Implemented in #2869
More possible optimizations:
- from:
(func $t1 (export "t1") (type $t0) (param $p0 i32) (result i32)
get_local $p0
i32.const 0
i32.lt_s)
to:
(func $t1 (export "t1") (type $t0) (param $p0 i32) (result i32)
get_local $p0
i32.const 31
i32.shr_s)
- from:
(func $t2 (export "t2") (type $t0) (param $p0 i32) (result i32)
get_local $p0
i32.const -1
i32.ne)
to (but not always possible, probably valid only if result used as i1 in next steps):
(func $t2 (export "t2") (type $t0) (param $p0 i32) (result i32)
get_local $p0
i32.const -1
i32.xor)
one more peephole optimization:
~(1 << n) -> rotl(-2, n)
which looks like in wat as:
i32.const 1
get_local $p0
i32.shl
i32.const -1
i32.xor
to:
i32.const -2
get_local $p0
i32.rotl
Done!
(x >> C) != 0 =========> (unsigned)x > ((1 << C) - 1)
((unsigned)x >> C) > 0 ==> (unsigned)x > ((1 << C) - 1)
((signed)x >> C) > 0 ====> (signed)x > ((1 << C) - 1)
(x >> C) == 0 =========> (unsigned)x < ((1 << C) - 1)
(x >> C) < 0 ==========> (unsigned)x < ((1 << C) - 1)
((signed)x >> C) < 0 ====> (unsigned)x >> 31
~~((unsigned)x >> C) < 0 ==> 0~~
Such distributive optimizations for integers also missing currently:
x * y + x * z
x * y - x * z
to
x * (y + z)
x * (y - z)
Upd PR: #3132
~~(signed)x % С_pot == 0 ==> (x & (С_pot - 1)) == 0~~
~~(signed)x % С_pot != 0 ==> (x & (С_pot - 1)) != 0~~
x % C == C ==> 0
x % C != C ==> 1, if C != 0
(unsigned)x % C >= C,
(unsigned)x % C > C ==> 0, if C != 0
(signed)x % C >= C,
(signed)x % C > C ==> C < 0, if C != 0
(unsigned)x % C <= C,
(unsigned)x % C < C ==> 1, if C != 0
(signed)x % C <= C,
(signed)x % C < C ==> (C ^ -1) < 0, if C != 0
Pretty common patterns. But what is surprising is neither LLVM nor GCC can't optimize this:
(x > y) - (x < y) < 0 -> x < y
(x > y) - (x < y) <= 0 -> x <= y
(x > y) - (x < y) == 0 -> x == y
(x > y) - (x < y) > 0 -> x > y
(x > y) - (x < y) >= 0 -> x >= y
Need to think could we generalize this more
// add-sub integer patters
(x + y) - x -> y
(x - y) - x -> 0 - y
(x + y) - y -> x
(x - y) - y -> x - (y << 1)
x - (x + y) -> 0 - y
~~x - (x - y) -> y~~ (#3014)
y - (x + y) -> 0 - x
y - (x - y) -> (y << 1) - x
(x + y) + x -> (x << 1) + y
(x - y) + x -> (x << 1) - y
(x + y) + y -> (y << 1) + x
(x - y) + y -> x
however probably better add general Reassociate sub pass which also handle others associative ops like |, &, ^
// div-mul integer patters
((signed)x / y) * y -> x - ((signed)x % y)
((unsigned)x / y) * y -> x - ((unsigned)x % y), if pure(x) && pure(y) && y != C
~~-(x + C) -> -C - x~~
~~-(C - x) -> x - C~~
-1 - x -> x ^ -1
bool(x) ? -1 : 0 -> 0 - x
bool(x) ? 0 : -1 -> 0 - (x ^ 1)
where bool(x) is getBits(x) == 1
float point simplifications:
x > 0.0 ? 1.0 : -1.0 -> copysign(1.0, +x)
x >= 0.0 ? 1.0 : -1.0 -> copysign(1.0, +x)
x < 0.0 ? 1.0 : -1.0 -> copysign(1.0, -x)
x <= 0.0 ? 1.0 : -1.0 -> copysign(1.0, -x)
x > 0.0 ? -1.0 : 1.0 -> copysign(1.0, -x)
x >= 0.0 ? -1.0 : 1.0 -> copysign(1.0, -x)
x < 0.0 ? -1.0 : 1.0 -> copysign(1.0, +x)
x <= 0.0 ? -1.0 : 1.0 -> copysign(1.0, +x)
copysign(x, +C) -> abs(x)
copysign(x, -C) -> -abs(x)
x * copysign(1.0, x) -> abs(x)
x * copysign(1.0, -x) -> -abs(x)
~~abs(x) * abs(x) -> x * x~~ (#3013)
copysign(x, y) == 0 -> x == 0
copysign(copysign(x, y), z) -> copysign(x, z)
copysign(x, y) * copysign(x, y) -> x * x
~~sqrt(x) < C -> x >= 0 && x < C * C~~
x / C_pot -> x * (1 / C_pot) (#3018)
where x -> float point, C_pot power of two float point constant
~~x * 2.0 -> x + x~~ (#3016) done!
It seems this transform may reduce up to 7 bytes due to wasm doesn't use any variant encoding for floats cc @kripken
~~i32(i64(x)) -> x, where x::i32~~ done
~~i32(u64(x)) -> x, where x::i32~~ done
same as:
i64.extend_s/i32 ;; or i64.extend_u/i32
i32.wrap/i64
~~i64(u32(x)) -> x & 0xFFFFFFFF, where x::i64~~ not optimal
~~or~~
~~i32.wrap/i64
i64.extend_u/i32 ;; but not for extend_s/i32~~
x ? 0 : C_pot -> (!x) << log2(C_pot)
x ? C_pot : 0 -> (!!x) << log2(C_pot)
x * (y ? C1 : C2) -> y ? C1 * x : C2 * x, if C1, C2 <- -1 | 0 | 1
examples:
x * (y ? -1 : 1) -> y ? -x : x
x * (y ? 1 : 0) -> y ? x : 0
x * (y ? -1 : 0) -> y ? -x : 0
~x | x -> -1
~~~x ^ x -> -1~~
~x + x -> -1
~x + 1 -> 0 - x
~~-x - 1 -> ~x (x ^ -1)~~
-1 - x -> ~x (x ^ -1)
~~x < 0 ? -1 : 1 -> x >> 31 | 1~~ done!
x / C -> 0 if maxBits(x) < maxBits(C)
like:
(x & 3) / 4 -> 0
i32(f64.max(f64(x), f64(y)) -> select(x, y, x > y), if type(x & y) == i32 | u32
i32(f64.min(f64(x), f64(y)) -> select(x, y, x < y), if type(x & y) == i32 | u32
i32(f64.abs(f64(x))) -> select(-x, x, x < 0), if type(x) == i32
DONE
~~i32(f64(x)) -> x, if type(x) == i32 | u32~~
~~u32(f64(x)) -> x, if type(x) == i32 | u32~~
~~floor(f64(x)) -> f64(x), if type(x) == i32 | u32~~
~~ceil(f64(x)) -> f64(x), if type(x) == i32 | u32~~
~~trunc(f64(x)) -> f64(x), if type(x) == i32 | u32~~
~~nearest(f64(x)) -> f64(x), if type(x) == i32 | u32~~
-(i32(x) / C) -> i32(x) / -C, if x != u32, C != min_s
this can't be optimized nor div_s either div_u: ~~-x / C -> x / -C~~