sdk WASM - Faster Native Uint8List

trafficstars

I'm working on some improvements to the Flutter engine. The browser has an API that takes a clamped array and creates image data from the array:

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Uint8ClampedArray

https://developer.mozilla.org/en-US/docs/Web/API/ImageData

This pull request uses this API to reduce heap usage in CanvasKit and SkWasm:

https://github.com/flutter/engine/pull/54486

While this works pretty well in Dart2JS because the clamped array can come straight from .toJS, in WASM there is a lot of copying, which makes trying to load 4k image data unreasonably slow. It's possible to create a native array first, and then use toDart to get an array that doesn't need to be copied, but that would require knowledge of the platform you are running on in order to create the appropriate type of array.

Ideally this could be done with zero copies all of the time, but i think there would need to be something like Uint8List or a Uint8List.native / Uint8ClampedList.native constructor so there is some way to ensure that we don't get an array that is going to be copied back and forth every time we use .toJS.

While ideally there would be a way to avoid the copies entirely, copying and initializing very large lists seems unreasonably slow with .toJS, there must be a lot of room for improvement.

Aug 10 '24 02:08 jezell

Summary: The issue is that using Uint8List.toJS in WASM to create a clamped array for image data results in excessive copying, leading to slow performance when loading large images. The user proposes adding a native constructor for Uint8List or Uint8ClampedList to avoid unnecessary copying and improve performance.

Aug 10 '24 02:08 dart-github-bot

While ideally there would be a way to avoid the copies entirely, copying and initializing very large lists seems unreasonably slow with .toJS, there must be a lot of room for improvement.

Mixed mode code is currently slow due to crossing the WasmGC<->JS boundary (e.g. having a loop over WasmGC arrays that load each byte and stores into a JS typed data array). There are upcoming changes in V8/Chrome that will improve this.

This pull request uses this API to reduce heap usage in CanvasKit and SkWasm:

https://github.com/flutter/engine/pull/54486

This is modifying flutter engine code which knows it runs in the browser. It also knows it wants JS typed arrays.

Why not let the code allocated JS typed arrays and passes them around?

Then there should be no copies in <jsArray>.toDart and no copies when going to JS again via <jsArray>.toDart.toJS.

Aug 12 '24 14:08 mkustermann

@mkustermann the issue is that the API itself exposes a Uint8List as the parameter, and there isn't a simple way to request the right type of Uint8List be created from the callers perspective.

  @override
  void decodeImageFromPixels(Uint8List pixels, int width, int height,
          ui.PixelFormat format, ui.ImageDecoderCallback callback,
          {int? rowBytes,
          int? targetWidth,
          int? targetHeight,
          bool allowUpscaling = true}) {

https://api.flutter.dev/flutter/dart-ui/decodeImageFromPixels.html

I did suggest that maybe changing the API so that the API itself was responsible for the allocation of an ImageData structure or something would be a better way to ensure the right kind of list is created, but that would be a breaking API change. It would be more desirable, I think, if we could guarantee the fast path, or at least make it easier for users to go down the right path. Right now every caller would have to have some kind of conditional import just to create the right kind of list to pass in. If the API could warn the user if they passed in the wrong kind of list and activates a slow path and if the user had a way to request the right kind be created without using conditional imports and understanding internals of js_interop, that would be nice.

Aug 12 '24 16:08 jezell

@mkustermann the issue is that the API itself exposes a Uint8List as the parameter, and there isn't a simple way to request the right type of Uint8List be created from the callers perspective.

Yes, it seems that this is an unfortunate API choice. It may be that

on the VM it would prefer to have malloc()ed data and wants to require a Pointer<Uint8>
on the web it would prefer to have JS typed arrays and therefore require a JSUint8Array

=> We have those types and flutter APIs could require them.

To facilitate platform-independent APIs one could make a

// Which platform-independent code can pass around (e.g. in package:typed_data or dart:typed_data)
abstract class NativeBuffer {
  Uint8List get bytesView;
}

// Code that creates web buffers or consumes them uses (e.g. in package:web or dart:js_interop)
class JSNativeBuffer extends NativeBuffer {
  final JSUin8Array jsArray;

  JSNativeBuffer(this.jsArray);

  factory JSNativeBuffer.create(int length) => JSNativeBuffer(JSUint8Array.withLength(length));

  Uint8List get bytesView => array.toDart;
}

// Code that creates malloc'ed  buffers or consumes them uses (e.g. in package:ffi or dart:ffi)
class PointerNativeBuffer extends NativeBuffer {
  final Pointer<Uint8> pointer;
  final int length;

  PointerNativeBuffer(this.pointer, this.length);

  factory PointerNativeBuffer.create(int length) => PointerNativeBuffer(malloc(length), length); // <-- TODO: attach finalizer

  Uint8List get bytesView => array.asTypedList(length);
}

=> It would make a lot of code platform-independent if it only passes NativeBuffers around. => Code that creates them would still be platform-specific (and may do so via conditional imports) => Code that consumes them would still be platform-specific (and may do so via conditional imports) => Code can test for specific subtypes pixels is JSNativeBuffer (**)

Though for the particular API in question here - decodeImageFromPixels: It would probably have a parameter of type NativeBuffer and therefore wouldn't be guaranteed to always get a JSNativeBuffer on the web (e.g. if we added support for FFI to the web, users could pass both PointerNativeBuffer as well as JSNativeBuffer). But it could document that on web it has to be one and on VM another and assert() in the VM-specific and web-specific flutter engine.

(**) We currently don't have APIs that allows one to test whether a Uint8List is backed by JS typed array or FFI Pointer.

I'm not sure if we should have a NativeBuffer.create(int length) that magically constructs JS typed data on web and malloc'ed typed data on the VM - e.g. due to the above issue where web may actually support FFI at some point at which point this becomes ambiguous.

/cc @eyebrowsoffire is that something flutter team would want?

/cc @lrhn

Aug 13 '24 13:08 mkustermann

Just to side-track ... If you want performance, never have a function both do computation and allocate a buffer for the result. Don't have allocation as a side-effect.

Instead pass the buffer in from the caller's side, or maybe a buffer factory function (which may be backed by a buffer pool, to not always do allocation). Always know who currently owns a buffer, don't let a function reuse a buffer that it was given earlier, always give it the buffer it needs for the current computation.

If the computation has a size that cannot be predicted ahead of doing it, provide a way to ask for chunks of results.

The C stdlib read function is a fine example: Takes a buffer (pointer + length) and returns how much of it was filled. Can be called again to get more data. Caller owns the buffer before calling and after return.

Obviously we didn't follow that advice when writing the platform libraries. Something like utf8.encode allocates. Even when used for chunked conversion, it allocates new buffers all the time, with not way to say that a buffer is free to be reused, and no way to abstract over the allocation (can't pass in a buffer pool or something). Sorry.

Other than that, is the issue here just that Wasm is inefficient in interop'ing between Dart typed-data and JS typed-data arrays? (Compiling to JS and using a Uint8List directly sounds fine, it's just a JS Uint8Array. Compiling to native and using a Uint8List directly is also fine. It's Wasm (on web) that has both available, and need to convert between them? (Does it?)

Could there be a way on Wasm to ask for Uint8List to be backed by one or the other? (Probably hard, since that API would make no sense on any other platform. Could be in a dart:wasm library, which provides Wasm specific helpers for interoperating between Wasm native code and JS code, and which is not the same as js_interop?)

Introducing a NativeBuffer class which is not a ByteBuffer (except that it is exactly the same for anything but Wasm) seems like a weird abstraction that you won't know to use unless you have already hit the problem on Wasm. Just like as Int64, you wouldn't ever start using it unless you already hit a problem with numbers on the web. (And you don't actually want it for much of anything, except storing 64-bit IDs. The number operations are spurious.)

If we introduce a NativeBuffer, when should you use it?

Should you always use it instad of a ByteBuffer? If so, why isn't it the implementation of ByteBuffer to begin with? If not, what is the migration path, if I start out using normal typed data, and decide that I want native typed data intead?

As written above, if I have a NativeBuffer and do .bytesView.buffer, I have a ByteBuffer that should be backed by the same native bytes. Then we're back at using normal typed-data types with that external buffer, and I'd expect those operations to be as efficient as possible. That'll be your job, @mkustermann :wink:.

At that point, why does the NativeBuffer type exist at all? Could We just have external factory ByteBuffer.native(int size); which provides a ByteBuffer backed by "native" data which is guaranteed to be as efficient as possible as a typed-data ByteBuffer and efficiently communicated to "native" code through js_interop/ffi as appropriate?

Aug 13 '24 14:08 lrhn

@mkustermann but can NativeBuffer be used in dart2js? Part of the complication here is that dart2js and dart2wasm have a massive performance gap, with the WASM version just being insanely slow.

Uint8List original = Uint8List(4000 * 4000 * 4);
final copy = original.toJS;

The second line here takes over 900ms my nice Macbook Pro when executing with flutter run --wasm, but nanoseconds with dart2js. It also takes nanoseconds if the Uint8List is created from calling toDart on a JS rather than using any of the constructors on the Uint8List. Two things bother me here:

Doesn't 900ms to do a mem copy typed array data seem like an insanely long time? A zero copy path would definitely be desirable, but it seems like the copy shouldn't take 900ms either way.
If we can get the fast path if the list is created appropriately, could there just be a way to create it appropriately in a platform neutral way and to assert that it was created as such? On non WASM platforms, it could just return the normal list, on WASM it could return one backed by native memory.

My concern here is that it's really easy to build footguns with large Uint8List arrays that tank WASM performance and make WASM way slower than the JS code. Is it being so easy to create these performance problems desirable? The docs do talk about the dual nature of the way toJS/toDart work here, but it seems like it's asking for trouble and there are going to be a lot of downstream performance problems any time someone needs to deal with large amounts of typed data in a WASM build. Most people aren't testing their packages with dart2wasm at the moment, and will be a bit shocked to find such massive performance gaps when trying to interface with js_interop without an easy way to write code that performs equally well with js and WASM builds.

Is there a reason Uint8List can't perform equally well and avoid the copies by default or by request?

Aug 13 '24 17:08 jezell

This issue is an ongoing one that has been discussed at length and we still don't really have a solution to. The core issue is that the Dart classes ByteData, ByteBuffer and all the subclasses of TypedDataList (including Uint8LIst for example) basically are designed around two assumptions:

Random access to its elements is very fast
It can be passed to host APIs

With the VM and/or AOT compilation, these feel like very safe assumptions. These objects are backed by a continuous block of memory, so random access is very fast and host APIs can simply take a pointer to that block of memory (or some subview of it). However, for Wasm/WasmGC, there is no single construct that both of these can lower to that fulfills both of these assumptions. In JS backends, these are always backed by their JS counterparts (DataView, ArrayBuffer or ByteBuffer). In Wasm though, random access to these is extremely slow, since it requires calling out to JS from Wasm for every single access, and those calls cannot be inlined or optimized by the JS VM. On the other hand, we can use WasmGC arrays, which are actually extremely fast, but do not interop with host APIs at all. (There is also a secondary issue of WasmGC arrays being non-reinterpretable, so you can't allocate a u8 WasmGC array and reinterpret it as u32, which the ByteBuffer APIs allow you to do).

As a result, we end up with some serious semantic mismatches between these Dart classes and our options for what to lower them to. In the fullness of time, I think we will need a more comprehensive solution in the Wasm spec. We currently don't even have an efficient way to copy data between WasmGC arrays and JS ArrayBuffer/TypedData/DataView objects.

Aug 13 '24 17:08 eyebrowsoffire

@eyebrowsoffire if there is no efficient way to copy the WASM array, that does present a bit of a problem with jank. If we can't get out a pixel buffer in less than 900ms from toJS, then toJS isn't really ever a good option inside WASM for large amounts of data. Even on a small image we might drop frames just from the toJS call for the slow path. The API has an async callback, so the work could be chunked to prevent jank if it was the slow type of Uint8List incoming, but that would be super slow in the case that the source of the buffer was actually a JS API and we had the type of list in our hand that allowed us to avoid the copy.

I think either way to ensure no jank automatically or warn users when what they are doing will cause jank we'd need a way to check what kind of list. Maybe something like a Uint8List.toJSWillCopy extension method in js_interop? That punts a bit on the fast copy path until we have one, but at least it allows the engine to ensure we don't hang the browser process on a toJS call.

Aug 13 '24 18:08 jezell

With the VM and/or AOT compilation, these feel like very safe assumptions.

This is not quite correct: The VM decides where to allocate buffers for typed data for a new Uint8List(...) call. Those cannot always be passed to a C function which wants a uint8_t* pointer. We have some special cases that make this somewhat work: a) make a ffi leaf call b) call special C code and pass the object as a VM handle and make the C code use Dart VM embedder API calls to acquire a range of bytes: BUT a) both of these should only be used if the call is very fast b) none of them can be used if one triggers object allocations, invokes dart code, etc.

=> So from a principled standpoint: This issue exists on the VM just as it exists on JS. => The VM would benefit as well from letting users choose a "host-compatible" array implementation that can be passed with zero-cost to C.

In Wasm though, random access to these is extremely slow, since it requires calling out to JS from Wasm for every single access, and those calls cannot be inlined or optimized by the JS VM

Those accesses can be inlined and optimized by the JS VM, it just hasn't been done yet (or rather optimizations - at least in V8 - have been implemented, but they haven't been enabled by-default yet). JS/WasmGC engines will eventually make accesses to JS typed data as well as JS strings fast. At that point we may decide to make dart2wasm use JS typed data just like dart2js (but I'm not fully convinced we should do that)

It can be passed to host APIs

I think it's more fair to say that JavaScript has designed their typed data APIs to be interoparable with host, they have moved the backing store to malloced memory, given them the ability to transfer/detach the backing buffers, ... - they made these trade-offs at the cost of slower allocation and slower byte access. (Dart2js just happens to use them because it cannot implement typed data efficiently in any other way.)

I wouldn't say the typed data APIs have been designed with the purpose of passing them cost-free to host APIs: For example using async I/O via dart:io will also copy the buffers across the boundary in some situations - precisely because the typed data objects do not generally have the capability of being usable from host API side at zero-cost.

These two use cases are different and have different trade-offs:

use typed data within dart (fast to allocate, fast access, need to copy across boundary) (VM & dart2wasm's choice)
use typed data by talking to host environment / other languages (slow to allocate, slightly slower to access, zero cost across boundary) (dart2js choice) - of which there could be multiple (e.g. JS browser APIs or C code running in linear memory wasm)

Ideally we'd have different types to make it very clear on the type system level that one requires one or the other. But at a bare minimum it should be possible for developers to choose the right one to allocate and then get the right behavior. Possibly also the ability to test for them.

How useful would it be for flutter to be able to test whether a Uint8Array is a JS typed array or not?

If we can't get out a pixel buffer in less than 900ms from toJS, then toJS isn't really ever a good option inside WASM for large amounts of data.

It may be not clear from the name, but without having a guarantee (e.g. due to knowing how it was allocated) one should assume that .toJS is a copying operation (e.g. not just typed data, also for dart lists and other data structures).

We're going to have a look at exactly what code gets generated for this copying and see if we can make this faster. But generally a) can image data be always allocated with host interop capability (i.e. in JS land) b) do we really need to copy these buffers around (seems not very performant, even if the memory copy is fast)?

Aug 13 '24 19:08 mkustermann

How useful would it be for flutter to be able to test whether a Uint8Array is a JS typed array or not?

I think in this case it would be very useful, as it would open up either the option of asserting when the slow path is being hit or having a dual path in the decoder to prevent jank.

a) can image data be always allocated with host interop capability (i.e. in JS land)

In a lot of cases the image data does come from JS land to begin with, for instance a PDF viewer would likely use pdf.js or pdfium as the source for the pixel data in which case it would likely receive a pixel buffer from js_interop as a js array. There are definitely also cases where dart itself could be used to construct raw pixel buffers, such as a dart native image editor. If a user could request the array be allocated with host interop capability, then they could avoid the slow path sending it to the renderer.

b) do we really need to copy these buffers around (seems not very performant, even if the memory copy is fast)?

In the engine code path here it definitely doesn't need to be copied. The PR that spawned this issue is specifically aiming to improve heap usage by eliminating copies alltogeter. With dart2js it works very nicely, but the dart2wasm implementation hit some performance snags here due to the inability to ensure a zero copy path.

Aug 13 '24 19:08 jezell

I think in this case it would be very useful, as it would open up either the option of asserting when the slow path is being hit or having a dual path in the decoder to prevent jank.

Flutter engine does have this ability actually. It should be allowed to import private dart:_* libraries and can test via

// typed_array_tests_wasm.dart
import 'dart:_js_types' show JSUInt8ArrayImpl;

bool isJSTypedArray(Uint8List bytes) => bytes is JSUInt8ArrayImpl;

// typed_array_tests_nonwasm.dart
import 'dart:_js_types' show JSUInt8ArrayImpl;

bool isJSTypedArray(Uint8List bytes) => true;

and use conditional imports to select wasm vs non-wasm:

import 'typed_array_tests_wasm.dart' if (dart.library.html) 'typed_array_tests_nonwasm.dart';

decodeImageFromPixels(Uint8List pixels, ...) {
  if (!isJSTypedArray(pixels)) {
    throw 'Require Uint8List be backed by JS typed array. Please use ... to allocate those typed data arrays';
  }
  ... 
}

If a user could request the array be allocated with host interop capability, then they could avoid the slow path sending it to the renderer.

Best would be the flutter API requires it to be a typed data with host interop capability and users of the API have to ensure it is (which they can via allocating into JS JSUint8Array.withLength(...).toDart or explicitly copying e.g..toJS.toDart with the knowledge that the .toJS may be slow).

Uint8List original = Uint8List(4000 * 4000 * 4); final copy = original.toJS;

The second line here takes over 900ms my nice Macbook Pro when executing with flutter run --wasm,

Are these buffer sizes typical? 4000 * 4000 * 4 is 60 MB! That's a lot of memory to be copying around per frame - even with memppy that's several milliseconds.

I've recently added benchmarks for speed of strings/typed-data crossing the boundary (in dart-lang/sdk@4908814f3a9375a02f10a5d268afff879c3d16e3) and made some optimizations (amongst them dart-lang/sdk@09a7e4f52b568f977f1509eb3c02de18caf2571e), which led on my older machine to 40 MBs/second.

Though this is still far (maybe 10-20x) from memcopy. We'll have a closer look at what machine code it currently generates and see what can be done.

Aug 13 '24 21:08 mkustermann

Flutter engine does have this ability actually. It should be allowed to import private dart:_* libraries and can test via

This is great. I'll give this a shot. I think maybe it would be nice if libraries outside the engine could also do this.

Best would be the flutter API requires it to be a typed data with host interop capability and users of the API have to ensure it is (which they can via allocating into JS JSUint8Array.withLength(...).toDart or explicitly copying e.g..toJS.toDart with the knowledge that the .toJS may be slow).

I think this is definitely a reasonable starting point.

Are these buffer sizes typical? 4000 * 4000 * 4 is 60 MB! That's a lot of memory to be copying around per frame - even with memppy that's several milliseconds.

When dealing with raw image data, it's pretty easy to get large arrays like this. Browser game engines often run into similar performance challenges with transferring of texture data. An iPhone captures photos at 3024 x 4032, which is not too far off if you put that into a raw pixel buffer. In cases like PDF rendering, which is what uncovered some of the memory management problems that led to the PR, you'll end up with far more than 4000x4000x4 worth of pixel data that needs to be bound to textures. A PDF viewer can easily generate enough pixel data to run Skia's WASM module out of memory (2 GB). These libraries can definitely be made more efficient, but 60 MB of pixel data is pretty small compared to the total amount of image data that will be passed back over the course of viewing a multi page PDF.

That said, it's typically not 4k per frame. If you load an iPhone photo into a texture it's a one time op, on a PDF it's when the pages are in view and there's a balance between rendering time and resolution.

Aug 13 '24 21:08 jezell

One point of discussion that should probably be considered here: better than using data buffers at all is actually to instantiate images directly from their URL using this dart:ui_web API: https://api.flutter.dev/flutter/dart-ui_web/createImageCodecFromUrl.html

Unfortunately, the framework doesn't take advantage of this API and we've discussed changing that. This is much more efficient in general, since we can rely on the browser to do the smart thing in terms of chunking the download, using hardware image decoding, etc. In some cases, the browser can avoid the raw pixels being on the CPU side altogether.

Aug 13 '24 21:08 eyebrowsoffire

@eyebrowsoffire I think some of that work did recently land:

https://github.com/flutter/engine/commit/6312dfc492cd6e0eb0768b164912d97451029c3a

Though that only works if it's a browser supported image type. In the PDF case or something like HEIC (iphone codec), you'd still likely be calling to some other library to get the raw pixel data first and then heading through this raw pixel buffer API to bind it to a texture.

Aug 13 '24 22:08 jezell

The issue is that the image widgets in the framework do not use the URL-based codec API. So even if that path is optimized, the common use case doesn't use that path.

And you're right that it doesn't universally solve this issue. But it should cover a number of common cases for users at least.

Aug 13 '24 22:08 eyebrowsoffire

The second line here takes over 900ms my nice Macbook Pro when executing with flutter run --wasm

I've looked more into the code V8 generates and optimized this in a way that allows V8 to do a better job. @jezell this operation should be now around 10x faster with newest flutter/flutter. It's still not at memcpy() speed - but we'll continue to look into it and push on things we'd need to get there.

Aug 21 '24 11:08 mkustermann

sdk sdk copied to clipboard

WASM - Faster Native Uint8List

sdk
sdk copied to clipboard