design
design copied to clipboard
Partially out-of-bound writes
The semantics of partially OOB writes was clarified in #1025. This has proved problematic for runtimes running on Arm, where there is no such guarantee in the architecture. I think all now browsers implement the spec properly, but I'm not sure if any of the wasm-specific non-browser runtimes do.
I could see that having deterministic memory contents is a nice property, but I'm not sure of what the real advantages are, could someone enlighten me?
And do you think there's any room to relax the spec in this regard, possibly even through an extension like relaxed-simd?
As one point, deterministic behaviour helps to generally ensure that different implementation decisions in different engines don't lead to observably divergent behaviours (even on a single underlying system). Presumably this would be relevant here when comparing engines implementing load/store with explicit bounds check vs guard pages.
How do the browsers using guard pages currently achieve conformant behaviour on Arm?
I'm not sure for Firefox: here's a lengthy thread about the issues they've found and it's not just Arm specific.
For V8, though I don't think I've actually enabled it yet for Chromium, we can rely on trap handlers for loads but have to perform manual bounds checks for stores.
deterministic behaviour helps to generally ensure that different implementation decisions in different engines don't lead to observably divergent behaviours
Sure, but I'm still struggling to understand in what situations it is useful, I don't understand how it helps once we trap. Can it help in some debug situations?
Sure, but I'm still struggling to understand in what situations it is useful, I don't understand how it helps once we trap. Can it help in some debug situations?
Sorry, I rushed to write the previous comment and didn't work through the reasoning properly. On the Web, it's possible to catch Wasm traps in JS and continue execution (and observe the state of the memory). We have a lot of previous experience with tiny edge-cases of the semantics being relied on by some Web program so there would be some inertia against introducing further non-determinism here. That being said, if browsers currently aren't (or even can't) implementing this conformantly (as the Firefox issue above suggests), that would be an argument that we could relax things, since existing Web programs couldn't be relying on conformant behaviour.
Okay, thanks.
From bug trackers though, it seems it only became 'a problem' once the necessary tests were added to the wasm spec test-suite. And I'd just like re-iterate that most (all) wasm runtimes only pass this test, on Arm hardware, because of microarchitecture details (luck!) so it's not just browsers who don't rely on conformant programs... it's basically everything. It seems it is both difficult to implement two policies and it also mean giving up ~10-20% execution performance when using explicit checks.
It seems it is both difficult to implement two policies and it also mean giving up ~10-20% execution performance when using explicit checks.
Yeah, if being truly conformant requires engines to totally give up on the guard page strategy, I would (personally) count that as "can't implement conformantly" and (personally) argue that the semantics should be relaxed.
@eqrion am I understanding correctly that today in "release" Firefox, it's possible to craft a program that (in some platform) can observably write bytes to memory in the case of a partially OOB write?
Even if we were to relax this, we would need to fix a behaviour in deterministic mode, which some platforms have a hard reliance on. Is the only solution for them to not use guard pages, or not use Arm?
I don't have an answer of how else to get around it... Maybe the penalty of explicit checks is worth the absolute determinism? Having to exclude an underlying hardware implementation doesn't sound very wasmy. Would you mind sending me some link(s) to the aforementioned platforms? I would really like to know about some concrete use-cases.
EDIT: On a single aarch64 Linux system, I found the execution time overhead of explicit checks, for stores only, to be ~5% across an arbitrary set of benchmarks.
One particular sensitive class of examples is Blockchain platforms running Wasm, because they rely on consensus between replicated distributed computations, possibly across heterogeneous hardware. That is only meaningful if all executions are guaranteed to be fully deterministic, including in cases where traps occur. There is bunch of these platforms, such as Dfinity, Parity, etc.
Thank you! I feared blockchain maybe the answer :)
So, sorry to get a little side-tracked, but I think I'm getting lost among the jargon... I see you are/were affiliated with Dfinity, so maybe you know for them, does their platform run on aarch64? And do you know whether most of these platforms are using wasmtime/cranelift? (IIRC, cranelift does not produce conforming code.)
I can see that Parity have their own interpreter, do you know if rolling-your-own interpreters/compilers is a popular choice?
Does the blockchain use case require consensus on failed executions? I'd have expected that a result is thrown out if it ends in a trap, and that that is effectively the way to get to deterministic outcomes for that case.
@sparker-arm Can you clarify which arm architectures might be affected? Does this apply to modern processors and arm64, or is this limited to certain classes of arm processors?
Can you clarify which arm architectures might be affected
To be pedantic, no arm architecture provides the guaranteed semantics. It is implementation defined. The problems, generally, do not arise during testing because it's either running in an emulator or on a microarchitecture that does provide the semantics. For arm designed cores, it can be loosely described as the big cores match wasm but the small cores do not. I don't have an exhaustive list. Apple silicon doesn't suffer from this inconsistency.
@sparker-arm, afaik, Dfinity currently only runs on x64 nodes, but it is not intended to stay like that. For their use case at least, an interpreter would be too slow, because it's a 3rd gen blockchain that is meant to support full-featured on-chain applications.
@tschneidereit, consensus also applies to trapping executions, since they have to materialise as a rejected message at the call site (for distributed message calls), so are an integral and well-defined part of the execution model.
That said, I'm sure there are other use cases for deterministic execution that do not care about the failure case. Reproducible execution/builds could fall into that category (not sure).
it also mean giving up ~10-20% execution performance when using explicit checks.
FWIW, the overheads can be much greater than that, depending on the benchmark. I've measured ~1.55x slow downs in both Wasmtime and SpiderMonkey on real world programs (spidermonkey.wasm running a JS markdown renderer).
@sparker-arm Is there any feature flag or other way to detect whether the underlying hardware does a partial write or not (other than, say, the runtime performing one and catching the signal at startup)? Is it at least guaranteed that the hardware behaves the same way consistently? (Would big.LITTLE throw a wrench into that perhaps?)
I'm also curious about other architectures: does anyone know offhand if e.g. RISC-V or other ISAs guarantee anything about partially-trapping writes?
@sparker-arm one other thought: is it sufficient, architecturally, to do a load (and throw away the result) to a stored-to address before every store? We would then take the fault on the load instead, without side-effects. Or does one need a fence to ensure the store doesn't reorder in that case?
Would
big.LITTLEthrow a wrench into that perhaps?)
Yup! As my alter ego 'grubbymits' accidentally mentioned, I believe almost all configurations of big and small would be incompatible. I have a feeling there maybe a single big armv8 that behaves differently to the rest.
@sparker-arm one other thought: is it sufficient, architecturally, to do a load (and throw away the result) to a stored-to address before every store? We would then take the fault on the load instead, without side-effects. Or does one need a fence to ensure the store doesn't reorder in that case?
I don't know, I need to speak to one of our memory model experts. I suspect that it may not work without atomics.
Thinking about the load-before-store idea a bit more: a page fault is a precise exception, in the sense that all side-effects of earlier instructions should occur, and none of later instructions should, correct? (The concern here is just about precision within the sub-operations of a single instruction's write.) In other words, if I have a load that faults on page A, and then later in the program, a store to address B, the architecture should unambiguously state that the store to B does not occur (or may occur speculatively, but we're only concerned with final architectural state here)? Otherwise, you couldn't restart after a page fault, because A and B might alias.
Even if this does work, I don't really think we should be coming up with hacks to get WebAssembly to run well on a very prominent architecture. It seems clear that the design decision was made without having enough information, and it appears that most runtimes continue to ignore the spec, so maybe the spec should change...
From the crypto use-case, it seems they all provide an SDK so they can fully control the target wasm feature set, and so they can disable anything that could introduce in-determinism. If this is the main case for fully deterministic memory, I wonder if we could have this as an opt-in feature?
@sparker-arm, deterministic mode (introduced with the Relaxed SIMD proposal, which is similarly messy) is what I was getting at. But we'll still need to specify some concrete behaviour for that. I am trying to understand if this mode would even be implementable on Arm without severe performance hits, and what the cheapest semantics would be for that.
Well, is there any way that we can implement alignment guarantees in WebAssembly? (I'm not sure I understand the utility of alignment hints.)
@cfallin
Is there any feature flag or other way to detect whether the underlying hardware does a partial write or not (other than, say, the runtime performing one and catching the signal at startup)? Is it at least guaranteed that the hardware behaves the same way consistently?
Sadly, the answer to all your questions is no. And yes, architecturally speaking there is no requirement that a particular implementation/microarchitecture/CPU behaves consistently, and that is even if we ignore the existence of technologies such as big.LITTLE (e.g. a single-core system).
Even if this does work, I don't really think we should be coming up with hacks to get WebAssembly to run well on a very prominent architecture. It seems clear that the design decision was made without having enough information, and it appears that most runtimes continue to ignore the spec, so maybe the spec should change...
It'd still be useful from my perspective at least to have a definitive answer on the load-before-store idea: if the spec doesn't change, then this is the only reasonable option I see to keep competitive Wasm performance on aarch64, unless I'm missing some other option!
Also, to be precise, on AArch64 the behaviour is not implementation-defined in the sense that an implementation is supposed to choose a specific behaviour and to stick to it - architecturally the contents of the in-bound portion of the memory affected by a store become unknown (so in theory they could become zero, for instance, even if the value to be stored is non-zero).
consensus also applies to trapping executions, since they have to materialise as a rejected message at the call site
@rossberg I'm still trying to understand the mechanics for consensus, do the linear memory of the instances have to be compared?
@sparker-arm, sort of. In the abstract, each node computes some form of cryptographic hash of the state changes it has performed, including the memory contents. Virtual memory techniques to identify dirtied pages at the end of execution make that practical and fairly efficient even for large memories, because the number of touched pages is bounded by the gas limit.
That said, we shouldn't narrow the consideration to the obscure needs of blockchains. Portable, deterministic execution and the absence of undefined behaviour (even where non-determinism is allowed) has been one of the original design goals and a big selling point of Wasm. We somehow need to maintain it. Demoting determinism to an opt-in mode already is a compromise that not everybody was happy with.
Okay, so I'm now trying to reason what is different about this case of non-determinism compared to the existing standard, and incoming extensions. Given that wasm can produce non-deterministic NaN values, this case doesn't feel wildly different to me.
I see that NaN values are normalised with instrumentation in the Dfinity SDK/runtime? Could partial writes be handled in the same way? Maybe even by breaking up stores into byte stores? It just seems there's existing precedence for working around the limitations for runtimes that need it.
It's in fact Wasmtime that does the NaN normalisation, it has a configuration option for that. That roughly corresponds to the idea of having a deterministic mode, which this engine (and possibly others) supports.
It would be okay-ish to only have deterministic behaviour of OOB writes in deterministic mode. But even then we need to resolve two questions:
-
In "performance" mode, the behaviour must still be well-defined. In the worst case, this definition could amount to complete non-determinism (could write any value whatsoever), but preferably, we should narrow it down more than that.
-
For deterministic mode we must specify a fixed behaviour. This behaviour should be implementable without excessive overhead and without burdening engines significantly. Breaking up all stores into individual bytes does not qualify, AFAIAC.
In "performance" mode, the behaviour must still be well-defined. In the worst case, this definition could amount to complete non-determinism (could write any value whatsoever), but preferably, we should narrow it down more than that.
IMO we should solve this the same way we planned to solve bulk memory + mem.protect a while ago. Independently nondeterministically choose whether or not each in-bound byte is written. I don't think we should write a spec that admits the Aarch64 "thin air" behaviour.
EDIT: actually, are there any implications here for the "last" write of a bulk memory op?
EDIT2: first instinct: no, due to the eager bounds check on bulk ops