streams icon indicating copy to clipboard operation
streams copied to clipboard

Could a byte stream read directly into and out of a SAB (w/o transfer)?

Open lukewagner opened this issue 8 years ago • 13 comments
trafficstars

It would be really great if wasm could read from a stream directly into its memory. The backing ArrayBuffer of a WebAssembly.Memory cannot be detached (nor would you want to since it holds the general program state needed for execution in the meantime) so what would be great is if the caller of a read or write could pass a typed array view of a buffer that is not detached. For a regular ArrayBuffer, this would mean some additional copying to avoid races. However, if the buffer is a SharedArrayBuffer, then perhaps it would be fine to expose the races (and no worse than anything else with SAB).

As an additional benefit, I think the implementation of "read into a view" from a file-backed stream could use mmap to make a copy-on-write mapping of the file directly into the target memory.

(cc @flagxor)

lukewagner avatar Jul 24 '17 21:07 lukewagner

There was some discussion of SAB with streams previously here: https://github.com/whatwg/streams/issues/495#issuecomment-284083329

It seems likely that we will add a way to use a SAB without transferring. Unfortunately it doesn't look like we will get the nice memory protection semantics that would let us avoid data races. I'm not concerned about exposing data races to wasm, since C++ programmers already live in a world of races. I feel it's unfortunate to expose it to Javascript.

ricea avatar Jul 25 '17 11:07 ricea

Ah, I see, so should I keep this issue open for this specific request or is it a dup of #495?

lukewagner avatar Jul 25 '17 16:07 lukewagner

I think this is useful as a separate issue. I've marked it with Milestone V2 to reflect that it's not something we'll be changing immediately.

ricea avatar Jul 26 '17 06:07 ricea

I think we should change this sooner rather than later. In Blink we're behind on implementing byte streams, but others have done so, and should be able to get these benefits---especially if we have interested consumers in the wasm space. Plus, I believe right now the spec is incoherent, as it tries to transfer SharedArrayBuffers, which causes a spec-level assert failure.

I do think it's sad that this exposes data races, but I also agree that SharedArrayBuffers in general broke that seal, and we should integrate with them rather than wishing they didn't do that.

I'll try to work on this when I get done with TC39 and some other work items...

domenic avatar Jul 26 '17 18:07 domenic

If in future we'd like TextEncoderStream's readable side to be a byte stream then we have a conflict here. It's undesirable for user code to be able to view the encode in progress, because it may be using vector ops or other things that are browser-specific. There are probably going to be other cases of platform streams where we'd like to prevent intermediate values from being visible.

I think all solutions involve adding an extra copy when the output is a SharedArrayBuffer. The question is whether this copy becomes part of the ReadableStream machinery that is specified by a flag, or whether the underlying source has to detect that controller.byobRequest.view is an SAB and do the copy itself.

ricea avatar Apr 23 '18 03:04 ricea

Talking to @vasilvv today in the context of WebTransport using whatwg streams and how that kind of use case could perform optimally when used with wasm: in addition to the advantages identified above, the new API could potentially use a callback design (instead of promises) where the SAB-view and callback were registered once and then reused throughout the duration of the stream, so that there was, by design, without engine heroics, zero garbage generated per packet/chunk.

lukewagner avatar Jun 28 '19 19:06 lukewagner

I don't think we're likely to add callback-based APIs. Eliminating the promise from the external API wouldn't eliminate the promise returned from pull(), nor the other objects that are created internally. The objective of "zero garbage" isn't achievable without engine heroics anyway, so I don't see a compelling reason to add APIs that don't align with the rest of the standard.

ricea avatar Jul 02 '19 05:07 ricea

From what I've heard from @vasilvv, there's a lot of overhead with the current approach, enough to be problematic for APIs considering using them, so perhaps more cross-cutting changes could be justified with the goal of not creating garbage on each chunk, given that performance is the point of the lower-level API.

lukewagner avatar Jul 02 '19 15:07 lukewagner

I think the root cause here is not that promises are slow per se, but rather that there are too many of them. If we switch to callbacks, we will probably just run into the problem of the callback being called too often.

I am actually not sure I understand the proposal here. I can see two things we can do with SABs:

  1. Let the application provide the browser with its SAB, and let the browser write into said SAB. This is roughly like BYOB in terms of performance, but can be used without detaching the buffer.
  2. Let the browser return views into some shared buffer as the output of the stream. This would enable potential zero-copy access to stream data, since the views could point to whatever internal structure the browser uses to store that data in first place.

The description in the issue seems to indicate that (1) is proposed, but the mmap idea seems to imply (2), since I don't think that would work with (1).

vasilvv avatar Jul 02 '19 17:07 vasilvv

If we switch to callbacks, we will probably just run into the problem of the callback being called too often.

FWIW, I did some prototyping to try to benchmark this before streams were standardized:

https://blog.wanderview.com/blog/2015/06/19/evaluating-streams-api-async-read/

I'm sure there are a lot of problems with what I did, but I think the conclusion was that promises were no worse than callbacks in the end. So I suspect you are correct better chunking is the answer.

wanderview avatar Jul 08 '19 15:07 wanderview

I am actually not sure I understand the proposal here. I can see two things we can do with SABs:

I'm not sure those two are in conflict. (1) is what the proposal in this thread is, in that it has observable effects on the developer experience of using streams. (Object identity gets preserved, which is easier to work with than having to constantly switch what ArrayBuffer you're operating on, even if the backing memory stays the same.) Whereas (2) seems like something an implementation could do behind the scenes with no observable effects, as long as it has the ability to create JavaScript SharedArrayBuffer objects that are backed by existing chunks of memory.

Indeed, I think (2) is maybe doable even with the current spec, with just normal ArrayBuffers?

The only tricky thing I can imagine is that the contents of the AB/SAB need to only visibly change to JavaScript at well-defined points (essentially, after read(view) calls). That still seems doable with a sufficiently-advanced ArrayBuffer <-> backing memory mapping, but we're getting in to territory I'm not well versed in. Maybe (2) is about loosening that invariant?

domenic avatar Jul 08 '19 16:07 domenic

I've been thinking about this a bit more, and I have an idea. I propose adding an extra option to getReader(), unsafe. This would only be used with mode: 'byob'. Setting unsafe: true would disable the buffer detach behavior, regardless of whether it is shared or not.

The reason for disabling detaching the buffer even when it is not shared is for the convenience of wasm programs which would like to be able to use part of their memory as a buffer without losing access to the rest of it.

controller.byobRequest would gain an extra attribute, also called unsafe, which reflects which kind of reader this buffer was supplied by. This can be used by the underlying source to add an extra memory copy for safety if it does something more complex than a simple copy into the destination.

JavaScript underlying sources mostly won't need to pay attention to the unsafe attribute because we assume they trust the other JavaScript or wasm running on the same page. But it may be useful in environments where there's only "partial" trust, for example when authoring a third-party library.

For underlying sources implemented as part of the platform it is far more serious. When implemented in C++, we don't even know beyond doubt that we can safely memcpy() into a hostile SharedArrayBuffer, as the language provides no guarantees. We just have to assume it is safe because otherwise we wouldn't be able to do anything at all. What I'm thinking of doing in Chromium is making the C++ API do a copy by default when unsafe is true for the target buffer, and then require the underlying source implementation to explicitly opt-out if all it's doing is a memcpy() anyway.

If we do this we should also rethink the enqueue algorithm called by other standards to be safe-by-default.

ricea avatar Mar 27 '23 07:03 ricea

Hmm. I think the idea of SharedArrayBuffer was to use the typesystem to encode this sort of guarantee. I.e., if you have a normal ArrayBuffer, you know you're safe; if you have a SharedArrayBuffer, you're unsafe.

What motivates the desire to let the safety vary independently of the type, for this particular web platform feature? Note that wasm memory can be made shraed using shared: true in the WebAssembly.Memory constructor, and I suspect most wasm thread-using programs already turn that on.

domenic avatar Mar 27 '23 08:03 domenic