bulk-memory-operations Expected to be used for large sizes?

The tracking issue for this feature says

We expect that WebAssembly producers will use these operations when the region size is known to be large, and will use loads/stores otherwise.

I don't see this mentioned in the Overview.md. Is this still an expectation?

Sep 08 '17 22:09 sunfishcode

My thought is that even if we tell folks to use it for large regions, they'll use it for small ones too, so we'll have to handle that anyway. I think @lukewagner originally suggested that the size have page units to prevent that. Is it worth it though? What's the cost to the VM to have to handle small regions?

Sep 08 '17 23:09 binji

The benefit I see for clamping to page sizes is that we remove any expectation that the wasm engine might optimize move_memory/set_memory by doing either of:

using constant-propagation to see if the size is constant and, if so, inlining something fast
some sort of IC to make tiny cases super-fast (i.e., not calling out to libc)

which lets engines compile move_memory to a call to libc memmove and be done with it.

Sep 08 '17 23:09 lukewagner

Wouldn't that remove the binary size saving?

Sep 08 '17 23:09 jfbastien

That's an interesting point, but I wasn't aware that this feature was expected to reduce binary sizes by any significant amount in any case. It would certainly change the nature of the feature (and what engines needed to do) if move_memory was used aggressively for this purpose.

Sep 11 '17 18:09 lukewagner

I think clamping to page sizes would cripple this feature and result in a proliferation of user code that tries to divide original requests into a page-multiple-sized chunk followed by cleanup code. That's a classic abstraction inversion.

Sep 25 '17 12:09 titzer

Why would there not be a single implementation of memcpy in libc? In general, we haven't used "toolchains will have to implement" as an argument to include things in wasm (e.g., trig).

Sep 25 '17 16:09 lukewagner

Just coming back to this...

It seems like the wasm page size is bit too large of a granularity -- the microbenchmark shows benefits for sizes < 64K.

In general, we haven't used "toolchains will have to implement" as an argument to include things in wasm (e.g., trig).

True, though we also seem to have assumed a mostly symbiotic relationship with producers, where they'll produce good code so the VM doesn't have to perform complex optimizations. I think it's reasonable to assume the same here -- if we give guidelines for the producer (TBD 😉) then can the VM assume that it isn't going to have to optimize a constant 4 byte memcpy that should have just been a load/store pair?

Oct 26 '17 23:10 binji

if we give guidelines for the producer (TBD 😉) then can the VM assume that it isn't going to have to optimize a constant 4 byte memcpy that should have just been a load/store pair?

I'd hope so.

Oct 27 '17 04:10 jfbastien

I think clamping to page sizes would cripple this feature and result in a proliferation of user code that tries to divide original requests into a page-multiple-sized chunk followed by cleanup code.

I agree. I think it would be safer to leave it at the byte granularity, assume that these operations will get used at both big and small sizes, and leave it to implementations to decide how (if at all) they want to optimise the small-size cases.

Mar 21 '18 19:03 julian-seward1

But in practice, if every wasm engine doesn't reliably optimize the small-constant-size case (which, from what I understand, is very commonly used) there will be a significant perf cliff which will require the toolchain (to provide reliable perf to its users) to do the lowering to loads anyway. With page-size-quanta, the responsibility for who does what is clear.

I don't see how this cripples the feature since this is an advanced optimization emitted only in special cases by compilers, not something anyone writes by hand in the source language.

Mar 21 '18 19:03 lukewagner

Well, it will force producers to produce sequences that mix calls and inline code, which will be verbose and will also inherently not be optimised for more than one target processor (how do you make the unroll vs vectorise vs unrolled-and-vectorised vs call-out tradeoffs if you don't know what you're running on?) I'm also not convinced that the small-constant-size case is uncommon: I frequently see a lot of bits of memcopy/memmove being called from compiled Rust, when profiling natively compiled Rust.

I do understand what you're getting at though. Would it be feasible and/or helpful to add to the spec, an advisory section that states a minimum set of copy/fill cases that an optimising Wasm implementation can reasonably expect to do well and in-line? That is to say, add some kind of quasi performance-guarantees to the contract?

Mar 21 '18 20:03 julian-seward1

Yeah, I suppose a non-normative note that states the contract, even if informally, could effectively make it the browser's "fault" if they didn't optimize appropriately, so producers could feel confident in always emitting mem.copy/mem.set.

Also, thinking more about what a producer would need to do to optimally use a page-quanta mem.copy/mem.set, it does seem suboptimal. In particular, if we use the existing wasm 64kb page size, then that means up to (128kb - 2) bytes of suboptimal copying (possibly significantly suboptimal if the producer doesn't do the extra work to use 64-bit copies (and, later, 128-bit)). If we use a non-wasm-page-size (<64kb) quanta, it'll feel rather arbitrary and probably look increasingly silly as CPUs evolve. Also, a fully-optimized memcpy wasm impl might cost a few hundred bytes which adds to the fixed runtime overhead which we'd generally like to avoid for webby use cases.

I'm fine with byte-granularity, then.

Mar 21 '18 22:03 lukewagner

@lukewagner would you rather also have an alignment hint, so you can do fancy stuff on top?

Mar 21 '18 23:03 jfbastien

If you're talking about "page-aligned" hint, I don't think it would help (the case browsers would have to specially-optimize is when the size was small and constant; for all others we'd just call out to the libc memmove).

Or perhaps you mean 1/2/4/8/16-byte alignment? When coupled with a constant size, such that the engine is inlining a straight sequence of load/stores, I guess I could see this being useful for the same reason that the alignment hint is present on scalar loads/stores, but that is a separate point from the one I made/rescinded above.

Mar 22 '18 00:03 lukewagner