binaryen icon indicating copy to clipboard operation
binaryen copied to clipboard

Replacing `memset` and `memcpy` calls with `memory.fill` and `memory.copy`?

Open torokati44 opened this issue 3 years ago • 13 comments

If I were to try working on a wasm-opt pass doing what's written in the title, would it be accepted? Gated behind the bulk-memory extension/feature of course.

The rationale I have for this is that when compiling Rust, even with the bulk-memory target feature (codegen option) enabled, a lot of naive (slow) memcpy and memset calls are left in the result, because they are called from std/core library functions, such as __rust_alloc_zeroed and __rust_realloc. These can be avoided by passing the build-std option to Cargo, but that is still an unstable, nightly-only feature.

Of course, this sets some assumptions about what a function that happens to be named "memset" or "memcpy" is supposed to be doing, but in a lot of compiler toolchains, these are pretty much already handled as intrinsics anyway.

One alternative would be to detect some forms of memory copying or filling loops in the control flow graph instead, and only replace those with intrinsics.

What do you think?

torokati44 avatar Dec 20 '21 11:12 torokati44

This is an interesting idea!

The rationale I have for this is that when compiling Rust, even with the bulk-memory target feature (codegen option) enabled, a lot of naive (slow) memcpy and memset calls are left in the result...

I would definitely want to see some measurements showing that using memory.copy and memory.fill would be faster than these Wasm implementations. I'm not sure that the difference would be that large, but it would be interesting to find out.

Of course, this sets some assumptions about what a function that happens to be named "memset" or "memcpy" is supposed to be doing, but in a lot of compiler toolchains, these are pretty much already handled as intrinsics anyway.

Since Binaryen generally avoids making assumptions about function naming conventions like these and in general might not get to see the function names, I don't think we would want to turn on a pass like this by default, e.g. I wouldn't want to include it in the passes run by -O2. It still might be interesting to add to Binaryen as a pass that users (or toolchains) can explicitly opt into, though.

One alternative would be to detect some forms of memory copying or filling loops in the control flow graph instead, and only replace those with intrinsics.

This would be really interesting as a code size optimization even if it didn't turn out to have much performance benefit.

tlively avatar Dec 20 '21 15:12 tlively

I would definitely want to see some measurements showing that using memory.copy and memory.fill would be faster than these Wasm implementations. I'm not sure that the difference would be that large, but it would be interesting to find out.

Oh, they can barely even be compared ... given large enough chunks of memory to work with, perhaps. Here is a few seconds long profile of a locally deployed https://ruffle.rs/demo running https://z0r.de/L/z0r-de_7617.swf, without bulk-memory: https://share.firefox.dev/3qbU7Ss

And here is the same thing with bulk-memory: https://share.firefox.dev/3EfzycL

The removed memset calls alone accounted for an almost 10% of the original total runtime, with another 3.3% remaining.

You can see how much faster vp_idct is, for example. The memset all but disappeared from it. And it's not just inlined or something, it is actually much quicker. While __rust_alloc_zeroed unfortunately stays the same, as mentioned in the original description.

Binaryen generally avoids making assumptions about function naming conventions like these and in general might not get to see the function names

Alright then, CFG matching and surgery it should be. Finding the sweet spot in terms of how long the copied/set chunk of memory has to be for the intrinsic to be faster will be an interesting experiment (my prediction is that it will start to get faster really quickly)...

torokati44 avatar Dec 20 '21 22:12 torokati44

I would like to also echo that I'm observing my Rust Wasm applications show memset and memcpy as performance culprits (though to a less extreme)

image

So it would seem like if wasm-opt gained the purposed functionality, it could have a wide benefit.

nickbabcock avatar Dec 24 '21 15:12 nickbabcock

I have started working on something... https://github.com/torokati44/binaryen/commits/bulkmemory-intrinsics Please excuse the quality, it's just an experimental working draft for myself, and only does anything with memset. Also, for the moment, it makes it about 3x-4x slower, because it doesn't yet recognize and replace the loop itself, only (part of) it's body, which most likely works against the JIT, in addition to the overhead of the intrinsic.

torokati44 avatar Dec 24 '21 16:12 torokati44

Although, the same issue could also be raised for the WASM Rust WG, because the included memset and memcpy implementations (with the bulk-memory codegen option not enabled) process one byte at a time (partially unrolled to batches of 8), with i32 intermediates. It could easily copy/set 4 or even 8 bytes at a time. Not to mention v128 at a time if the SIMD extension is enabled.

torokati44 avatar Dec 24 '21 16:12 torokati44

@nickbabcock The current Rust 1.58.0 beta is performing somewhat better already in this regard, and it should come out as stable in just under two weeks. Maybe you too could give that a try.

torokati44 avatar Dec 31 '21 01:12 torokati44

Thanks for the heads up. Spent a few minutes profiling the 1.58 beta. There does seem to be a small improvement (low single digit percentage). My workload can be characterized by small allocations, so the word-wise copies introduced in 1.58 shouldn't have too big of an impact (and idk, maybe memory.fill won't either).

nickbabcock avatar Jan 01 '22 13:01 nickbabcock

I see. You can try that too with RUSTFLAGS="-C target-feature=+bulk-memory" (or similar), but anything under 10-100 bytes likely won't benefit from it due to the overhead it has. And the built module won't work in Safari 14.x.

torokati44 avatar Jan 01 '22 14:01 torokati44

Nice idea!

What size memcpy/memset are measured here? It definitely makes sense that big ones would get faster with this, but I worry about small ones. The VM may do a call instead of emitting code inline which can be a lot slower potentially. I vaguely remember measurements about that from a few years ago that were not good, but perhaps things have improved.

If we measure that and it's an issue then we could perhaps avoid doing this when the size is constant and small enough.

kripken avatar Jan 05 '22 00:01 kripken

What size memcpy/memset are measured here?

In the linked/attached profiles? I don't know, really. And since the memsets are mostly compiler-generated, I also don't know of any straightforward ways to tell.

Anyway, a fellow developer did this benchmark: image (Ignore the wasm-opt in all the legend labels, it didn't make much of a difference as of right now. The line styles each correspond to Rust stable (1.57), nightly (1.59), and I think also nightly, but with bulk-memory enabled.)

This is how he summarized it: "main point being, indeed under 8 bytes a memcpy can be 2-3x cheaper than memory.copy"

I'd say 8 bytes is a really low turnover point, I would have guessed/expected something closer to, say, a 100.

This was the code he ran:

#[inline(never)]
pub fn bench(dst: &mut Vec<u8>, src: &Vec<u8>) {
    if dst.len() == src.len() {
        dst[..].copy_from_slice(&src);
    }
}

#[wasm_bindgen]
pub fn work(n: usize, runs: usize) {
    let mut a = vec![0; n];
    let b = vec![0; n];
    for _ in 0..runs {
        bench(&mut a, &b);
    }
}

Maybe the optimization could, if the size is not a known constant, even add a branch as well, to do it one way or another, based on the number of bytes to be done...

torokati44 avatar Jan 05 '22 22:01 torokati44

Thanks @torokati44 ! 8 is indeed fairly small.

There is some data on typical values of memcpy/memset in the wild, like this paper (raw data). The common values are usually very small. So 8 is in the relevant range. (Those links are for LLVM-libc work, it may make sense to learn from them regarding possible solutions at the toolchain level.)

Although, the same issue could also be raised for the WASM Rust WG, because the included memset and memcpy implementations (with the bulk-memory codegen option not enabled) process one byte at a time (partially unrolled to batches of 8), with i32 intermediates. It could easily copy/set 4 or even 8 bytes at a time. Not to mention v128 at a time if the SIMD extension is enabled.

I would recommend doing that. Copying 4 bytes at a time is what emscripten does atm, for example.

And yes, SIMD can do even better. Potentially it could compete with the browser's internal copying.

Maybe the optimization could, if the size is not a known constant, even add a branch as well, to do it one way or another, based on the number of bytes to be done...

Also reasonable. For comparison, in emscripten if the size is large enough we call out to JS and use typed array .set() which is very fast. I assume that uses the same memcpy that bulk memory would but has better backwards compatibility.

kripken avatar Jan 05 '22 23:01 kripken

Just a heads-up: With https://github.com/ruffle-rs/ruffle/pull/5834 just merged, and Rust 1.58 (with its improved built-in memory loops) also being right around the corner, this got a lot less important for me personally, so I don't think I'll pursue this any further in the near future. While it does sound really interesting, I only have so much free time, and so many things I want to do. If anyone wants to pick it up, feel free to look at the branch linked above in my fork; or take some ideas from here, which looks like it does a really similar thing, albeit for a different IR: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Scalar/MemCpyOptimizer.cpp

torokati44 avatar Jan 13 '22 08:01 torokati44