Multibyte Access to Array i8
Summary
Add instructions to read multiple bytes at a time from (array i8).
Motivation
A number of languages currently use (array i8) as a backing store for custom data types and/or byte buffers style objects (e.g., custom structs, Dart typed arrays, JVM byte arrays). Reading and writing to these custom data types requires performing sequences of single-byte operations that are highly inefficient, and hindering performance.
More discussion on this issue in the GC repo.
Proposal
Add the following new array instructions. All of the instructions will be similar to the load/store memory instructions and take an immediate memarg {offset u32 , align u32 }. The load_X instructions will only be valid when expand($t) = array (mut i8) and the store_X instructions when expand($t) = array (var i8). All instructions will trap with out of bound offsets (including partially out of bounds).
array.load_i16_{u,s}
array.load_i32
array.load_i64
array.load_f32
array.load_f64
array.load_v128
array.store_i16
array.store_i32
array.store_i64
array.store_f32
array.store_f64
array.store_v128
Alternatives
In the original discussion a few alternatives were discussed. A summary of the different options:
- Use current load/store instructions, but use some of the memarg bits to signal load/stores from arrays.
- Reinterpret cast on array types.
- Pinning part of the array to a memory.
- Use the slice proposal.
The above options do have some advantages such as not introducing more instructions or reusing other proposals. However, they are all more complicated than adding relatively straightforward array instructions that would achieve the desired results.
Open Questions
- Do we need other simd load variants e.g. get_8x8_s, …?
- Would it be preferable to have get_X and set_X instructions that take indexes instead of memargs as an operand?
cc @mkustermann, we were talking about Dart wanting something like this recently.
Thanks for proposing this. I think overall this makes sense to me.
All of the instructions will be similar to the load/store memory instructions and take an immediate memarg {offset u32 , align u32 }.
I don't know if we want to take a memarg here. I don't think we'd be able to do anything useful with the offset or align arguments with how wasm GC arrays are implemented. memarg also now has a memory index from multi memory that we'd need to ignore.
Use the https://github.com/WebAssembly/design/issues/1555.
I see this proposal as orthogonal to adding slice's to wasm, and would support both. My mental model for a wasm array is something that stores the actual elements inline like a C-array. While a slice has an inherent indirection to point into memory that is owned by some other object (host or wasm). So having instructions that perform multibyte access would be useful with either object.
I'd want to add this to the open questions:
Do we need atomic variants of these operations as part of shared-everything-threads?
I would hope, the answer is no? If the answer is yes, we'll need to bring up alignment requirements for the atomic variants of the operations.
And I assume the i32 index from the value stack is based on the i8 of the array and unrelated to the load representation, so even with an offset % sizeof(load type), the access could be unaligned?
I would hope, the answer is no? If the answer is yes, we'll need to bring up alignment requirements for the atomic variants of the operations.
This proposal + shared-everything should have atomic versions of these ops (although this proposal may move faster, in which case the onus would be on shared-everything). I don't think this is too difficult - we should use the same alignment requirements that already exist for atomic ops on memories.
This would mean that realistic implementations would need to ensure that shared arrays are word-aligned, but I don't think anyone would have a problem with this!
And I assume the i32 index from the value stack is based on the i8 of the array and unrelated to the load representation, so even with an offset % sizeof(load type), the access could be unaligned?
That would be most consistent with the way the existing load/store instructions work for memories. It's still ok (and maybe useful) for non-atomic accesses to be unaligned.
A couple of questions/thoughts about the OP:
Add the following new array instructions. All of the instructions will be similar to the load/store memory instructions and take an immediate memarg {offset u32 , align u32 }. The load_X instructions will only be valid when
expand($t) = array (mut i8)and the store_X instructions whenexpand($t) = array (var i8). All instructions will trap with out of bound offsets (including partially out of bounds).
Should this be expand($t) = array i8 for load instructions and expand($t) = array mut i8 for store instructions?
Reinterpret cast on array types.
There are a few nice consequences of this alternative approach:
- fewer new instructions
- atomics ops from shared-everything are automatically inherited
- combinations like
i32accessed asi64orv128are also possible
Have we considered this alternative seriously at any point? I can imagine this working uncontroversially for the i{8,16,32,64} types, but would there be any complications for floats or SIMD?
Reinterpret cast on array types.
There are a few nice consequences of this alternative approach:
There are some significant downsides to this: Right now there are very clear rules for aliasing. A store on an i32 array can never alias with a load on an i64 array etc. Allowing reinterpret casts on them would break this and optimizing compilers would need more "smartness" in tracking the potential aliasing and in many cases this would lead to regressions in which loads can be eliminated.
@conrad-watt My primary concern around reinterpreting one POD array as another (e.g. i32 as i8) is how to emit efficient bounds checks. For SM, our arrays all have a length field which is the 'number of elements' (not bytes). If you were to be able to reinterpret all the different kinds of POD arrays between each other we would no longer statically know what the element size originally was and it wouldn't work. I think we'd need to always store the byte length there instead, and then convert the input index into a byte index for all bounds checks.
It's also unclear how reinterpret casts could be made to interact as you would expect with normal casts. The underlying heap value can only have one RTT, so the fact that the array has multiple static types would not be able to be reflected in the runtime type, at least not without some extra complexity I haven't thought about.
I am worried that adding bespoke instructions on Wasm GC arrays will constantly fall short of what is possible on Wasm linear memories. There are already many, many load+store instructions on Wasm linear memories, which were carefully designed to expose all of the hardware performance possible. With atomics, we also have read-modify write and sequentially consistent atomics. Adding new instructions to Wasm GC that only handle a few cases, especially when they are tied to just array i8, seems like an ad hoc solution that will require constant updates and feature creep.
I think the best way to solve this problem is to find a connection (through the above-mentioned slices, or another solution) to temporary memories with a custom page size. We should focus on how to solve that problem so that the existing instructions can all work, because ultimately, the engine will generate effectively the same machine code, with just different bounds checking.
There are certainly several wrinkles to work out around that (e.g. how to wait/notify work, how expensive is a slice operation, how long do slices live, are they first class, etc), but I think this temporary solution would result in cascade of issues we'll not be able to able to back out of.
I agree that it seems inevitable that we will want near feature parity[^1] between memories and arrays. I also agree that we should think about our options to simplify that connection as much as possible. But I think "just create an array version of every applicable memory operation" is actually an extremely simple solution that will be hard to beat and easy to scale, even if it's sad to use so many opcodes.
[^1]: I do not anticipate ever needing (or wanting to provide) wait and notify on arrays. waitqueueref should be better for all producers that can use WasmGC.
I agree with @titzer that finding a way to represent this zoo in terms of existing memory instructions is more economic and at least an attractive option to consider. I used to think this wouldn't fly because of their entirely different implementation, but custom page sizes indeed could be a usable hook.
Note however that arrays are referenced dynamically instead of having static indices. Such a solution would hence effectively require introducing versions of load/store instructions for first-class memories, likely by extending their flag field further. Not necessarily a problem, but a complication in its own right. But it may be justified by being more forward-looking and scaling better.
Such a solution would hence effectively require introducing versions of load/store instructions for first-class memories, likely by extending their flag field further.
Wouldn't that still mean that instead of duplicating most of the memory operations for wasm-gc arrays, we'd duplicate them for first-class memories and then potentially share the opcodes with arrays? That's still the same amount of new instructions?
I don't agree that memories with custom page sizes are anywhere close to arrays:
- custom page size memories have a size and a max size, they can grow, while a wasm-gc array is fixed size (helpful for bounds checks).
- In web-engines a wasm memory is backed by an
ArrayBufferwhich is not only much more expensive than a gc array, it is also much more complicated. (E.g. when having two accesses to the same memory location with a call with unknown side effects in between, the second memory access could fail because theArrayBuffercould be detached now. So from an implementation perspective, there wouldn't really be any benefit in aligning gc-array and memory operations as they are completely distinct from what the engine does. - Wasm is still expanding the capabilities of memories. We have memory64 now for which there isn't an equivalent for gc-arrays and memory-control and other proposals will just make arrays and memories even more distinct in terms of capabilities than they already are, so putting them closer together doesn't really sound like something we'd want?
@Liedtke, it wouldn't require new opcodes, since the existing loads/stores already have an extension point with their flag field. We used that for multi-memories already. We would only need new opcodes for memory.* instructions on memory references, if we even want to support those right now.
I don't disagree that there are various open questions around such a design, but it would be worth at least thinking through. A few thoughts on the points you brought up:
Re 1: An array would only be viewable as a memory with dynamic type whose lower and upper bound are the same. Such memories are not growable either. And this is not necessarily visible from the static type of a memory neither, which may under-approximate its actual dynamic type. Arrays don't add anything new in that regard.
Re 2: The hope would be that memories with custom page size 1 are potentially more lightweight. We could even annotate array view memory types further if needs be. I don't think detaching an ArrayBuffer associated with a memory is possible or observable from within Wasm, that would violate its memory model.
Re 3: I would actually take possible future features as an argument in favour of the unification, since we might well want all the load/store functionality, such as 64-bit indexing, on arrays in the future. I assume that memory control will probably require new memory type attributes anyway, since its availability affects basic invariants of a memory. Given that, I'd expect that we would still be able to treat arrays as an instance of a particular memory type.
Another option that I've kicked around is to have non-first class memories whose underlying storage can be dynamically rebound. A (non-shared) memory could be declared to be "rebindable" and a dynamic instruction could rebind it to refer to a slice of memory, e.g.
memory.bind[mem_index]: [(ref null (array i8)) i32 i32] -> []
That instruction would bind a slice of the input array, start, and length, to the memory index mem_index, which could be used like other memories. It'd be non-shared to avoid race conditions between binding and accessing between two threads.
This has downsides, but big advantages in that I/O APIs could also work with GC'd arrays then as well.
@titzer, hm, I'm less sure about this option. That does not fit the (presumably common) use case well where there is more than a small, statically fixed number of array objects. And it seems like an odd mechanism to add on top of refs and tables, which we use in other comparable cases. Introducing proper memrefs is the more flexible and systematic approach (which could equally be used by I/O APIs). Eventually we'll want them anyway.
@rossberg I agree that a first-class slice of an array and memory would be a nice feature, and I could see a path to allowing existing load/store instructions supporting them with a flag, but there are other memory instructions that don't have flags, e.g. all of the bulk memory operations. The idea with the bindable memories is that programs would use them to temporarily bind a slice to a memory index, do some operations on the temporary binding, and then release or rebind them to something else. It's a workaround to fit arrays (and memory subranges) into the existing set of instructions (and incidentally, I/O APIs) that are all centered around memory indices. They'd also be non-trashy, in that binding a memory wouldn't create any temporary objects. I'm not sure how to do that with memrefs; they'd be at least a triple of (array, start, length), which is a bit too wide for a fat pointer.
@titzer:
The idea with the bindable memories is that programs would use them to temporarily bind a slice to a memory index, do some operations on the temporary binding, and then release or rebind them to something else.
One of the problems I have with that is that it would be providing a local temporary binding through global state. That approach is brittle with reentrance, and basically incompatible with threads, coroutines, and the likes.
@rossberg I was considering a number of more local alternatives such as allowing binding a memory local to a function or local to a block, but that shares downsides (for interpreters and other parts of an engine) similar to the lexical rethrow mechanism, such as having state local to a function. I think rebinding is only brittle at the application level, not at the Wasm level. There are plenty of other ways of using Wasm that are brittle at the application level.
I see rebinding more as a way to retrofit Wasm GC arrays into an already large and complex ecosystem and tool space that has grown accustomed to linear memory, rather than building a new mechanism that will further bifurcate the GC and non-GC worlds.
@titzer:
I think rebinding is only brittle at the application level, not at the Wasm level.
Hm, can you elaborate what you mean by that? I'd believe that using a fixed set of global state for a dynamic set of local bindings is problematic no matter what, since it does not compose or scale and moreover, would usually have to be accompanied by global locks to not blow up in the presence of threads. How would you imagine producers dealing with that?
I think it'd be fine if we end up with two copies of all of the memory instructions, as long as it solves all the use cases and we don't end up needing three copies. One for first class 'sliceref'/'memoryref'/'whatever', and the other for our current second class memories. It won't make implementations any more complicated, and just take up opcode space (which we have plenty of in prefixes).
I'd prefer we not overload the flags of load/store instructions so that they start referencing something first class and instead just choose new opcodes. These code paths are likely to be very different (memories are very special), and so I think a separate opcode is warranted.
For those who want to follow along... I've added an overview in the mutlibyte access repo. The current proposal is to reuse the existing load/store memory instructions. I've also started prototyping it in binaryen.
Please file issues for any concerns or thoughts on how you want the proposal to progress. I'd also be curious to hear what people think we should do for the text format syntax.