Flat data representation proposal: Enables zero copy shared memory, zero allocation return types, binary serialization
This all started with defining zero copy shared memory over a WIT interface (channel is WIT resource, inspired by iceoryx2):
let channel = Channel_u32::new("topic");
loop {
let message = channel.allocate().await; // WASI 0.3
message.set(42);
message.send();
}
and on the receiver side
let subscription = Subscription::new("topic");
loop {
dbg!(subscription.read().await);
}
with a WIT definition similar to
resource object {
set: func(u32);
send: static func(object);
}
resource channel {
allocate: func() -> future<object>;
}
resource subscription {
read: func() -> future<u32>;
}
This is all fine unless you try to place a list<string> inside the shared memory. This put me on a journey which culminated in this discussion issue, … after I figured out a way to express this in WIT (this is inspired by flatbuffers and capn-proto).
Flat marker
Adding a flat<T[, P]> marker, e.g. flat<list<string>, u16> to arguments or results will change the data representation to flat binary encoding: All pointers in list and string become of the second type and are relative to the current position. The same type is used for length encoding. The default pointer type P could be s32.
Passing an argument will follow the normal ownership rules, so imported functions only pass a view while exported functions pass ownership of the buffer. The flat type is represented by a classical (pointer, length) pair. See https://bytecodealliance.zulipchat.com/#narrow/stream/438936-SIG-Embedded/topic/Sept.2017th.202024.20Meeting/near/470965874 for data encoding examples.
Returning a flat data type would change to a caller provided buffer (uninitialized) as the last argument (also (pointer,length)). The call returns the used length (0 indicates error/buffer overflow). This makes the call defined with respect to (partial) ownership transfer.
Similarly to async with WASI 0.3 and future<T> this could become a general option to apply to all functions, making #385 unnecessary, because this is more flexible and more storage efficient.
Buffer objects
Obtaining these buffers from the IPC component requires two new WIT return types: buffer-mut<T> and buffer-view<T> (read-only), both would encode as (pointer, length) and require a drop method to indicate that the buffer/view is no longer in use.
Side benefits
This data representation can also be used as a disk or network encoding of data expressed in WIT (make sure to version your WIT desciption).
API considerations
True zero copy construction of these flat data types require to know in advance the size of a list and pass it to the constructor to linearly place objects in the buffer, relative pointers could be unsigned to simplify the encoding logic.
See the links in https://bytecodealliance.zulipchat.com/#narrow/stream/438936-SIG-Embedded/topic/Sept.2017th.202024.20Meeting/near/470497166 for API examples in Rust and C++.
PS: I initially represented read-only flat types by address only (as the length can be calculated from the data), but this feels counterproductive from a verification and storing perspective.
Of course the lowering of flat POD types would be identical to normal POD types, I consider (resource) handles as POD here. So the modifier only applies (recursively) to string and list representations.
Update: (Resource) handles don't serialize well across systems, so this needs more thoughts on when to forbid them.
Having a "flat" binary representation of compound values could make a lot of sense and I've tried to imagine different ABI variations too (esp. in the context of streams, which help address the issue of not knowing how much buffer space to allocate since you can always just fill up one buffer, say "not done", and return for the next buffer). However, I've generally thought of this in terms of Canonical ABI options, since it's a low-level representation choice; is there a specific benefit to escalating this detail into the WIT-level type, where it applies to all languages and memory types (e.g., wasm-gc...)?
Second, while I can see potential efficiency benefits to a flat binary representation, I don't see how this achieves "zero copy shared memory" -- it seems like the basic requirements to copy between separate components' separate linear memories remains?
Lastly, I wasn't able to follow the "Buffer objects" section and how it relates to the flat type or how buffer-mut<T>/buffer-view<T> compare to, e.g., the readable-buffer<T> and writable-buffer<T> of #369.
I started with a WIT marker because I assumed that the same interface might mix flat and normal ABI calls, but I am no longer sure about this, especially since flat types offer some unique benefits - but are source code incompatible to normal Vec and String types (Rust, similar for C++).
Zero copy comes into view if you construct the lowered elements in place in shared memory (you use a shared memory located buffer to construct everything) and use them on the receiver side without lifting. Of course for wasm you need either multi-memory (shared pages) or mmap support to enable two components to access the same physical memory. Host (mmap) support could enable spatial freedom from interference, that means only a single component can write to it, exclusive or multiple components can read from the same memory region. The host would handle the transition between these states (similar to what iceoryx does).
This assumes that you reached a state where the copying of information between components is more costly than remapping virtual memory. This is typical for large AI tensors and camera images.
The flat buffer types are handles to the shared memory managed by the host logic*, one read-only shareable, one exclusive writable type. The difference to a non-flat read/write buffer is that the flat buffer will also contain all the second and third level allocations, so a list<list<string>> object becomes a single contiguous memory object within a single allocation.
*) Or local buffers pre-allocated and then passed to functions to place the result into.
The difference to a non-flat read/write buffer is that the flat buffer will also contain all the second and third level allocations, so a list<list
> object becomes a single contiguous memory object within a single allocation.
Ah I see, that's an interesting point. I suppose we have the option to say that a readable-buffer<T>/writable-buffer<T> could use a different, flat ABI for the T. That being said, in some cases, the indirection is actually what you want (considering that in many cases 99% of the bytes are in the "leaves" of a compound value and being able to just point to the pre-existing allocations avoids what would otherwise be an extra copy into the flat buffer). But perhaps there could be a flat canonopt that lets you opt into this flat ABI for buffers?
Of course for wasm you need either multi-memory (shared pages) ...
Many folks have suggested using multi-memory as a solution to avoiding copies over the years, but we keep finding that, in practice, "regular" C/C++/Rust code can only access the default memory so if you use a shared non-default memory to pass values, you'll end up with 2 copies (source → shared → destination). I keep asking someone to show me real code that would achieve zero-copy in practice using multi-memory (b/c hypothetically it's possible), but I haven't seen it yet.
... or mmap support to enable two components to access the same physical memory. [...]. This assumes that you reached a state where the copying of information between components is more costly than remapping virtual memory. This is typical for large AI tensors and camera images.
One way to amortize the cost of establishing a shared mappings is creating a long-lived connection between two components which they can use to repeatedly passed chunks of memory. My intuition is that streams might be the right abstraction here (for repeatedly passing a large (flat) element). So perhaps the flat option mentioned above could also apply to streams (which lines up with the idea that streams are just a sequence of buffers).
:thinking: I feel that a proof of concept implementation might be a good idea to see how shared memory and flat types could work together to achieve zero copy. I will give it a try (most likely Rust and wasmtime based).
I feel that 'multi-memory' is more convenient for communication between the host and Wasm. I mean, if we give a Wasm module an additional imported memory that is provided by the host, the host can store data in that specific area, and Wasm can access it directly without needing to copy it from the host's memory to Wasm's linear memory.
@lum1n0us Do you know a good way to model access to a non-zero memory from a clang compiled language, e.g. C or Rust? Load and store intrinsics could be a solution, but that feels clumsy and cannot be passed via a pointer/reference argument to subroutines; segmented memory means that every load/store will pay significant penalty when coding memory indeces and offset separately. I think mmap as an extension of memory-control is the most reasonable strategy I can come up with.
@cpetig Yes, I think that is the fundamental challenge we're working with here. And to summarize previous discussions: if the solution is to copy from the second-memory into the default-memory, I think we end up with something net worse, both in terms of performance (2 copies instead of one) and portability (since the entire contents of this non-default linear memory are now the host/guest interface, observable at all times at any address -- very likely to expose subtle impl differences that break programs in practice at scale over time).
I just created a working proof of concept crate for the flat data parsing and creation at https://github.com/cpetig/flat-types-rust , the API already looks usable but will need a lot of extensions to provide a nice DX. I kept enum, struct and tuple APIs for now out of scope, likely a derivation macro will give this in a "somewhat" elegant way (set_X, get_X functions).
I will continue my work on the shm wasm interface.
I started a first prototype of shared memory zero copy at https://github.com/cpetig/wasm-shm-test/blob/main/wit/shm.wit#L12 but didn't complete it, yet.
Progress update: I have a working example of zero copy (single) publisher (multiple) subscriber with the AUTOSAR adaptive API. Asynchronous WASI 0.3 streams, compiled to native. Combining Rust and C++ in a single executable (with multiple shared object modules). Containing complex data types in the exchanged data (list/vector and string).
So with some more design work I ended up at a prototype which could soon work with wasmtime: https://github.com/cpetig/wasm-shm-test/blob/main/symmetric/test/publisher/src/lib.rs (publisher source code). I will continue to extend this towards a working prototype.
🥳 Update: Starting from my AUTOSAR experiments I now have a webassembly prototype for zero copy publisher subscriber. It uses WASI 0.3 streams to broadcast host side shared memory buffers (a resource) from one publisher to two subscribers. The components pre-allocate linear memory (size calculated by the host to enable page aligned padding) for memory mapping the data into their linear address space.
The buffer API could be implemented with copying without the guests noticing, also several efficient zero-copy mappings to MPU only embedded are possible. The host can optimize the implementation to avoid attachment during run-time without affecting the ownership logic, this of course assumes a well-behaved publisher; subscribers remain untrusted. Adding a subscriber only requires creating another stream<memory-block> on the sender side (and writing four bytes to it per published data). So the overhead per subscriber is minimal.
In a point to point communication the subscriber receives ownership of resources sent by the publisher. For shared buffers the writer passes the ownership to the buffer, subscribers won't touch it (can only read-only borrow access), the writer destructs the previous objects when the buffer is re-attached for overwriting.
Note: I abuse dummy resource handles to represent linear addresses, as these scale to 64 bit on native.