rune icon indicating copy to clipboard operation
rune copied to clipboard

Make TensorFlow Reuse Buffers in WebAssembly

Open Michael-F-Bryan opened this issue 2 years ago • 3 comments

Quoting directly from https://github.com/hotg-ai/rune/issues/242#issuecomment-906305787


At the moment all data lives inside WebAssembly linear memory and the only copies that happen are when we do inference.

I think we should try to keep that property because trying to have some buffers living inside WebAssembly and others living in the host's memory sounds like a bad idea (hard to manage lifetimes, WebAssembly can't directly access host memory, etc.).

As I understand it, the inference copies are mainly because we haven't told TensorFlow Lite to read/write directly from WebAssembly memory instead of allocating its own buffers, and not due to a limitation in the Rune-Runtime interface. So an easy way to avoid copying into/out of TensorFlow is to just make it use the buffers we provide.

Michael-F-Bryan avatar Sep 06 '21 19:09 Michael-F-Bryan

Okay so, this is slightly more complicated than i thought it would be.

There are 2 types of zero copy code paths:

  1. TfLiteDelegate specific buffer handles:
/*
Set the delegate buffer handle to a tensor.

It can be called in the following cases:

    * Set the buffer handle to a tensor that's not being written by a delegate.
    For example, feeding an OpenGL texture as the input of the inference graph.
    * Set the buffer handle to a tensor that uses the same delegate. 
    For example, set an OpenGL texture as the output of inference, 
    while the node which produces output is an OpenGL delegate node.

WARNING: This is an experimental API and subject to change.
*/
TfLiteStatus SetBufferHandle(int tensor_index, TfLiteBufferHandle buffer_handle, TfLiteDelegate *delegate)
  1. Just set the allocation type for various input/output tensors - this is probably the path for us to take:
  // Assigns (or reassigns) a custom memory allocation for the given tensor.
  // `flags` is a bitmask, see TfLiteCustomAllocationFlags.
  // The runtime does NOT take ownership of the underlying memory.
  //
  // NOTE: User needs to call AllocateTensors() after this. In case of input
  // resizing, buffers will be checked for required data size during
  // AllocateTensors().
  //
  // Parameters should satisfy the following conditions:
  // 1. tensor->allocation_type == kTfLiteArenaRw or kTfLiteArenaRwPersistent
  //    In general, this is true for I/O tensors & variable tensors.
  // 2. allocation->data has the appropriate permissions for runtime access
  //    (Read-only for inputs, Read-Write for others), and outlives Interpreter.
  // 3. allocation->bytes >= tensor->bytes.
  //    This condition is checked again if any tensors are resized.
  // 4. allocation->data should be aligned to kDefaultTensorAlignment
  //    defined in lite/util.h. (Currently 64 bytes)
  //    This check is skipped if kTfLiteCustomAllocationFlagsSkipAlignCheck is
  //    set through `flags`.
  //
  // WARNING: This is an experimental interface that is subject to change.
  TfLiteStatus SetCustomAllocationForTensor(
      int tensor_index, const TfLiteCustomAllocation& allocation,
      int64_t flags = kTfLiteCustomAllocationFlagsNone);

While 1) provides the best hardware acceleration, I think we have to go with 2) - simply because we don't support accelerated proc blocks right now, that can make use of such a handle..

This decision essentially limits our product scope. For people who want complete hardware acceleration (eg: real time video processing from multiple/high resolution cameras), They probably have to implement their custom solution as opposed to using Rune.

As for 2) - We shouldn't pass it dummy tensors anymore. We should be allocating the tensors we want to use for the rest of the inference. I will expose a size_t tensorBufferAlignment() const function to allocate the buffers aligned to that value.

We probably have to change the function signature for infer() too then?

saidinesh5 avatar Sep 07 '21 07:09 saidinesh5

As we've discussed on Slack, it sounds like the way TensorFlow Lite wants you to provide buffers goes directly against how WebAssembly likes its linear memory to be used.

In Rust terms, our rune_model_infer() host function is given access to a byte buffer that can only be used for the duration of the function call (e.g. because a later allocation in WebAssembly code may cause linear memory to grow/reallocate).

fn rune_model_infer(input_tensor_in_wasm_linear_memory: &[u8], ...) {
  ...
}

However in order to do zero-copy, the inference context wants to hold onto the same buffer for its entire lifetime.

struct InferenceContext<'buffer> {
  input_tensor: &'buffer [u8],
}

This poses a bit of an issue because we'd really like to reuse the same InferenceContext across multiple rune_model_infer() calls because it avoids the expensive process of initializing an inference context and setting up hardware, but the borrow checker would reject that ("Error: the 'buffer lifetime cannot outlive the lifetime '_ as defined on the method body").

You also aren't really allowed to update the tensor buffers immediately before each infer() call because the pointer may have already been captured by a delegate and there is no way to let the delegate know about the change.

Michael-F-Bryan avatar Sep 07 '21 09:09 Michael-F-Bryan

@saidinesh5 would this be covered by https://github.com/tensorflow/tensorflow/issues/46766?

In particular this bit:

The tensor input and output C APIs should have options to use input tensors in-place, from binary blobs, from either CPU/system memory or device (e.g. GPU) memory. This "zero-copy" API would remove any copy overhead and any overhead related to protobuf creation.

Michael-F-Bryan avatar Sep 07 '21 10:09 Michael-F-Bryan