binaryen
binaryen copied to clipboard
Non-nullable locals, option 1a
Now that "1a" is the final implementation of non-nullable locals in the spec, we should implement it. That approach does have the downside of extra work for tools like Binaryen, though, so it's not trivial to do. The problem is that a set allows gets only until the end of the current block, which means many natural code transforms "break" 1a, such as:
(local.set $x ..)
(block
(local.get $x)
)
(local.get $x)
=>
(block
(local.tee $x ..) ;; set+get => tee here
)
(local.get $x)
(local.set $x ..)
(local.get $x)
=>
(block
(local.set $x ..)
..new code.. ;; new code added here, wrapped in a block
)
(local.get $x)
However we handle this, we need to add fixup logic for 1a, as well as specific workarounds in relevant passes (we don't want to add a local that will end up fixed up later anyhow, if that is worse, which some passes know). Aside from that, some possible approaches:
No 1a validation in Binaryen IR
In this approach binaryen IR allows more things than normal wasm, in particular, non-nullable locals can be used arbitrarily and not just in the 1a form. To emit valid wasm binaries:
- Fix up 1a during binary writing. That is, when we find a local that does not validate, make it nullable and add
ref.as_non_null
on its gets (to keep the types of gets identical).- Extra
ref.as_non_null
may be removed, but this is too late to run the pass that does so.
- Extra
- Fix up 1a as part of the pipeline, probably pretty late, and with the opt pass to remove
ref.as_non_null
after it.- Just running an opt pass or two may not validate - the user must remember to run the fixup pass.
- Do both of the above: fix up late in the pass pipeline, and also in binary writing. The binary writing fixups would hopefully never happen if the normal pass pipeline is run.
A downside is that we never validate that we are emitting a valid wasm (again, since we don't validate 1a). Also, intermediate states during the pass pipeline - before we get to the fixups - may not validate as 1a, which can be annoying during debugging.
We'd also need to decide what to do in the text format. Perhaps we wouldn't do fixups there, which would keep it 1:1 with Binaryen IR for testing.
1a validation in Binaryen IR
Here our validator adds 1a checks. Each pass must therefore emit valid 1a. Some options for that:
- Fix up after each pass we run, automatically, in the pass runner.
- Passes that might break validation must run a fixup.
- This adds a few lines to most of our passes.
The extra work is something like 7% slower compilation overall. It's not a huge cost since it's basically 1a validation, which is simple and fast, and the data is in cache already since we just worked on it. Still, it's a noticeable slowdown. (Just on GC builds, though: when GC is off we can avoid any extra work.)
A downside here is that forcing 1a limits what we can do in intermediate steps. In theory allowing more there might help out even if we end up undoing it later (the other effects might remain). On J2Wasm and Dart at least this seems to be negligible, though it does happen.
Thoughts?
I would prefer not to validate - it gives optimizations more flexibility and less to think about and does not have a significant performance cost.
For the fixup pass, I would prefer to run it as a normal pass that is inserted into the default pass pipeline so we can run OptimizeInstructions as a follow-up. We should also run it as an unconditional preliminary stage of module writing to cover situations where the default pass pipeline is not used. We currently assume that every pass that modifies Binaryen IR must necessarily invalidate stack IR, but we can add a preservesStackIR
method and have the fixup pass work on both stack IR and Binaryen IR to avoid losing stack IR optimizations.
I think we should have our text format match our binary output to the extent possible, but I guess we already print tuple
instructions without lowering them away, so I could be convinced otherwise on this.
Running on both normal IR and stack IR is a good idea. In stack IR though it would need to insert new instructions (casts), which is awkward there, but it's doable. And I guess we could start by just invalidating stack IR, which might be ok as a small temporary regression.
The other downsides of that approach worry me more, though: not actually validating 1a on our output (only VMs would catch any bugs of ours) and intermediate states do not validate on 1a (annoying when debugging intermediate passes - can't run those on a VM. Mostly a burden for us and not our users, but still annoying).
not actually validating 1a on our output
In debug mode perhaps we can add some extra validation as part of module writing.
intermediate states do not validate on 1a
When the intermediate results are dumped during debugging, I guess they would have the same fixups applied?
When the intermediate results are dumped during debugging, I guess they would have the same fixups applied?
If we do that then it would affect reproducibility. That is, if we run passes A,B,C, and we save right after B from the middle to a wasm file, running C on that wasm would not necessarily be identical to the original run of all passes in sequence. This can be quite annoying in practice.
If we don't apply fixups to text files then that would be an escape hatch for that situation, which is already useful today for stacky code.
An interesting issue the fuzzer noticed here: the SSA pass will create new locals and also rename existing ones. Even if it fixes up non-nullability of new locals, the renaming can break 1a validation, because as an optimization it doesn't bother renaming in unreachable code. The renaming is not observable there to execution, but it is to 1a validation. Example:
(local.set $x (..))
(unreachable)
(local.get $x)
==> ;; ssa
(local.set $x.renamed (..)) ;; rename in reachable code
(unreachable)
(local.get $x) ;; do not rename in unreachable code
That local.get
may have no set
for it any more.
This isn't hard to work around (the normal workaround can be done), but it does show an interesting corner case.
Another interesting situation:
(unreachable)
(local.get $nn) ;; a get of a non-nullable local, in unreachable code
=> ;; some pass that decides to reorder those two
(local.get $nn)
(unreachable)
Reordering a local.get
is always valid, so this can happen in various ways. If the get has no set before it at all, then it is reading a null into a non-nullable local. That hits an assertion in the interpreter, which would not happen before that pass.
The problem is that such a local.get
of a non-nullable value has a "side effect" of not being valid to execute. If we fix things up after each pass then this situation never occurs - the local would become nullable before we execute the code - but if we leave fixups for late in the pipeline or during binary writing then this situation can happen.
The fix for this particular example is probably to trap in the interpreter, which means basically implementing option 6 from the spec discussion. As reordering traps is valid in Binaryen IR, that would make this valid once more. Reordering with other things might still be a problem, but only in code that began broken (it marks a local as non-nullable but it can read a null).
Trapping isn't a full solution either, I realize. E.g. RemoveUnusedBrs wants to do this:
(if
(condition)
(call $noticeable)
(local.get $nn)
)
=>
(select
(call $noticeable)
(local.get $nn)
(condition)
)
That would execute the get unconditionally, which would then start to trap. If the condition was guarding against exactly that then we are in trouble.
I'm not sure we have a good way out of this. We could just make EffectAnalyzer mark all local.get
of a non-nullable local as potentially trapping, but that would have a significant cost - it would prevent that example just shown, and it is valid if the local is in 1a form, which is 90%+ of the time. So we'd want to check which locals have 1a form and which do not, and while that's linear, we don't really want to be doing it on each EffectAnalyzer computation - it's a function-wide operation when sometimes we just look at one thing.
The most reasonable path might be to perform 1a after each pass ("1a validation in Binaryen IR" in the first post). That's cheaper than doing it on each EffectAnalyzer. As mentioned above it adds 7% overall, but maybe that's the cost we need to pay here?
(A "heavier" IR, more like those in optimizing VMs, would just constantly track the 1a-edness of each local.get
, and update it in each transform. That might be faster overall, but it would be a huge change that adds a lot of complexity.)
Another downside I noticed while investigating the options here using fuzzing: Fixing things up during binary writing makes fuzzing very slow. If there is a testcase that hits a bug, and fixing up the non-nullable locals would avoid the bug, then each iteration in the fuzzer will avoid the bug when it is written out. We'd maybe want to add an option to avoid fixups during binary writing, which adds more complexity.
The fix for this particular example is probably to trap in the interpreter, which means basically implementing option 6 from the spec discussion.
I'm not sure trapping makes sense. Working backwards: the semantics of the program after fixups is that the local.get succeeds and the fixups should semantically be no-ops (they just map the internal language with more relaxed validation to the real language), so the semantics before fixups should also be that the local.get succeeds.
Trapping isn't a full solution either, I realize. E.g. RemoveUnusedBrs wants to do this...
If we don't trap, then we don't run into that problem and the normal fixups can make the code valid as necessary.
Well, we could just hack the assertion that checks that an instruction emits a type that is a subtype of it's declared type (just for non-nullable local.get
) - that would avoid the immediate assert - but that null could flow outward to a use,
(unreachable)
(struct.get $A 0
(local.get $nn)
)
=> ;; some pass that decides to reorder those two
(struct.get $A 0
(local.get $nn)
)
(unreachable)
Now the struct.get
must trap. In that case, we might as well make the local.get
trap, which avoids hacking that assertion. It also seems like it would be more consistent and predictable.
(But this seems smaller than the problem in my post after it.)
The interpreter should still have local.get
trap where the fixup would make it trap, namely when it hasn't been initialized at all in the current execution. Is that the same as what you were thinking?
But I don't think there should be any special handling in the effects analyzer since we already prevent gets from moving back before their corresponding sets.
The interpreter should still have local.get trap where the fixup would make it trap
After the fixups, yes. But before the fixups we run into the problem I mentioned. So this is an issue for only doing the fixups late in the pipeline + in binary writing (and it is not a problem for doing fixups after each pass).
I guess my high-level point is that the interpreter should provide the semantics before and after the fixups. That's independent of when and how often fixups are applied.
I'm not sure what you mean by "provide the semantics"? Are you saying the semantics should be different before and after the fixups?
After the fixups I see no problem. But before them, I don't see a good option either way:
- If we don't trap we'd be punching a hole in type safety in the interpreter. For example,
(ref.cast (local.get))
that has a non-nullable type would now be able to return a null. We wouldn't just need to disable the assertion onlocal.get
but on everything the null can reach: the interpreter would need to always think its ok if it sees a null flowing out of a non-nullable expression. - If we do trap then we really need to make EffectAnalyzer aware of that, with the downside of worse code unless we actually compute 1a there.
I'm not sure what you mean by "provide the semantics"? Are you saying the semantics should be different before and after the fixups?
Sorry, I meant to write "provide the same semantics" but missed the key word 😅
Concretely, when executing a local.get of a non-nullable local, the interpreter should:
- trap if the local has not been initialized
- return the value of the local if it has been initialized (and the value is not null because we do not allow writing nulls to non-nullable locals)
We don't need to make EffectAnalyzer aware of the traps, since no optimization would move a get before the first set in reachable code anyway. So there should be no effect on optimization quality.
trap if the local has not been initialized [but] We don't need to make EffectAnalyzer aware of the traps
I don't think this can work as it hits the problems I mentioned before. Over the weekend I realized I may not have been explaining it well, so let me try again.
var x = non null Foo; // declare non-nullable local
x = (return, getFoo()); // x is set a sequence of a return + get a non-null value
x // get x
=> // 1. simplify the set: "getFoo()" and "x =" do not execute, so just return
var x = non null Foo;
return;
x // this get no longer has a set
=> // 2. reorder the return with the get. the get has no effects, so it's ok
var x = non null Foo;
x // non-nullable local getting a null
return;
=> // 3. fix up non-nullable locals
var x = null Foo; // now nullable
if (!x) trap; // this traps
return;
This program now traps, but it shouldn't. (This isn't hypothetical - credit for finding this goes to the fuzzer.)
This happens because we didn't tell EffectAnalyzer about an effect. One solution is for local.get
of non-nullable locals to be marked as trapping. There might be other ways to work around this:
- If it only happens in unreachable code then running DCE after every pass should work. (But that would be even more costly than fixing up 1a after each pass.)
- Perhaps running Vacuum before fixing up non-nullable locals would work. That would remove the
local.get
in the example above (since it has no effects). But I'm not sure that's enough, and it seems fragile. - Avoid reordering code when one item is unreachable (suggested by @tlively offline). I think that might work, but I'm not sure what the downsides would be. It also assumes the problem is limited to unreachable code, which I am not 100% sure of.
The other option is to fix up 1a after each pass. That avoids the above problem because between steps 1 and 2 we'd make the local nullable and mark the get as possibly trapping, after which it is marked properly and won't be moved into reachable code. In theory a pass that does all of 1,2,3 at once could hit such a problem, so future passes would need to look out for that in principle, but writing modular passes as we do tends to avoid that complexity (writing large composite passes always has the requirement of the pass not emitting something invalid at the end).
I would say we should just go with the fixup after every pass option for now, but a 7% optimization time hit seems pretty bad. It might be worth investigating the no-reordering-with-unreachable approach. Maybe we should land the expensive, simple fix now but keep working on improving it?
I've meanwhile done some optimizations on the fixup-after-each-pass approach, and it's less than half of the original 7%. I think I can get it even lower.
I believe I've gotten this down to 2-4% overhead (2% on Dart, 4% on J2Wasm). However, that is still somewhat annoying... Before I open a PR I think I'll tinker with the parallel approach of doing more DCE. I suspect that very few passes may actually end up creating DCE opportunities, so manually annotating them may lead to good results.