wasm-micro-runtime icon indicating copy to clipboard operation
wasm-micro-runtime copied to clipboard

[RFC] Assign part of wasm address space to a shared heap to share memory among wasm modules with zero-copying

Open wenyongh opened this issue 2 weeks ago • 11 comments

About the requirement

Many scenarios require to share memory buffer between two wasm modules without copying data (zero-copying) and there were developers asking the issue. But since the wasm spec assumes that a wasm app can only access data inside its linear memory(ies), it is difficult to achieve that, normally we have to copy data from the caller app's linear memory to the callee app's linear memory to call callee's function. People may use some methods, like multi-memory, GC references, or core module dynamic linking, but there are some limitations, like the support of toolchain, the user experience to write the wasm application, the requirement of advanced wasm features, the footprint and so on. Here we propose a solution for it: assign part of wasm address space to a shared heap to share memory among wasm modules with zero-copying.

Overview of the solution

As we know, there is address mapping/conversion between wasm address space of linear memory and the host address space: for example, in wasm32, the wasm linear memory's address space can be from 0 to linear_mem_size-1, and the max range is [0, 4GB-1], and there is corresponding physical address space for the linear memory allocated by runtime, let's say, from linear_mem_base_addr to linear_mem_base_addr+linear_mem_size-1. The mapping is simple and linear: [0 to linear_mem_size-1] of wasm world <=> [linear_mem_base_addr, linear_mem_base_addr+linear_mem_size-1] of host world. But since in most cases, the max linear memory size is far smaller than 4GB, we can use the higher region of the wasm address space and map it to another runtime managed heap to share memory among wasm modules (and also host native).

The idea is mainly to let runtime create a shared heap for all wasm modules (and host native): all of them can apply/allocate memory from the shared heap and pass the buffer allocated to other wasm modules and host native to access. And the allocated buffer is mapped into the higher region of the wasm address space: in wasm32 the address space (or we often call it offset) for a wasm app is from 0 to 4GB-1 (which is relative address but not native absolute address), suppose the wasm app's linear memory doesn't use all the space (it uses 0 to linear_mem_size-1 and normally linear_mem_size is far smaller than 4GB), then runtime can use the higher region for the shared heap and map the shared heap's native address space into the region, for example, from 4GB - shared_heap_size to 4GB -1. And runtime does a hack when executing the wasm load/store opcodes: if the offset to access is in the higher region (from 4GB - shared_heap_size to 4GB -1), then runtime converts the offset into the native address in the shared heap to access, else runtime converts the offset to the native address in the wasm app's private linear memory to access. Since the wasm address space of the higher region is the same for all wasm modules and runtime accesses the higher region with same way, a wasm module can pass the buffer inside it to another wasm module, so as to share the data with zero-copying.

And runtime provides APIs to allocate/free memory from the shared heap, e.g. a wasm app can import function like (env, shared_malloc) and (env, shared_free) can call it, the import functions are implemented by runtime. For host native, runtime may provide API like wasm_runtime_shared_malloc and wasm_runtime_shared_free. And the shared heap size can be specified by developer during runtime initialization.

From the view of wasm app, it has two separated address regions, and it is not a standard behavior of the wasm spec, but it doesn't break the wasm sandbox since the memory access boundary checks can be applied for both the two regions. There is a performance penalty since additional boundary checks should be added for the higher region, but I think it should be relatively small and should be acceptable compared to copying buffer mode.

Eventually, when a wasm app wants to share a buffer to another wasm app, the code may be like:

    buffer = shared_malloc(buffer_size);
    write data to buffer;
    call func of other app with buffer as argument
    ...
    shared_free(buffer);

image

Main changes

  • Add WAMR_BUILD_SHARED_HEAP cmake variable and WASM_ENABLE_SHARED_HEAP macro
  • Add wamrc --enable-shared-heap flag
  • Add a global shared heap, and put it in the default memory (use same definition of default memory in multi-memory feature #3381):
    • If the wasm app has one or more defined memories, then the first defined memory is the default memory
    • If the wasm hasn’t defined memory, then the first import memory is the default memory
  • Support setting shared heap’s size and shared heap’s allocation options in RuntimeInitArgs for wasm_runtime_full_init
  • Add shared heap info in WASMModuleInstanceExtra and AOTModuleInstanceExtra
    • heap_handle, base_addr, size and so on
    • For AOTModuleInstanceExtra, put the info after DefPointer(const uint32 *, stack_sizes), and the layout is pre-known by aot compiler during compilation time
  • The code that may be impacted by address conversion/validation:
    • aot code boundary check
    • interpreter/fast-jit do boundary check
    • runtime APIs of address conversion and validation
      • wasm_runtime_validate_app_addr, wasm_runtime_addr_app_to_native, etc
    • wasm_runtime_invoke_native, wasm_runtime_invoke_native_raw
    • libc-builtin and libc-wasi may do extra checks
    • more?
  • Add check in load module: the default memory’s max memory size should be no larger than 4G-shared_heap_size, otherwise, truncate it to 4G-shared_heap_size
  • Add API wasm_runtime_shared_malloc/wasm_runtime_shared_free
  • Wasm app use API shared_malloc/shared_free to allocate/free memory from/to the shared heap
  • How to do boundary check:
    • if (offset >= 4G-shared_heap_size && offset < 4G) then access shared memory else keep same as original (both for hw bound check and non hw bound check)

Others

  • It will impact the performance since extra boundary check codes are added, but it should be good comparing to copy data to callee's linear memory.

wenyongh avatar Jun 19 '24 01:06 wenyongh