dynasm-rs icon indicating copy to clipboard operation
dynasm-rs copied to clipboard

Allow reserving space in various Vec and HashMaps for performance

Open Ekleog opened this issue 1 year ago • 19 comments

Exposing these interfaces gives a significant performance boost in some use cases, as per the numbers listed on https://github.com/near/wasmer/pull/142: as much as a 20-25% performance gain over the whole compiler on some benchmarks (that made me notice such an optimization would be useful), without even fine-tuning parameters.

The thing I'm least certain about is exposing the relocs and relocs_mut functions, but they seem to make sense to me given how the labels and labels_mut are already exposed.

What do you think about this? :)

Ekleog avatar Sep 29 '22 14:09 Ekleog

Oh, and I forgot: technically, near/wasmer#142 only uses VecAssembler::reserve_ops, LabelRegistry::reserve_dynamic and RelocRegistry::reserve_dynamic because they were the only ones that appeared in profiles. I'm exposing other methods here too, because I'm thinking we may as well just expose them in case someone wants to try them out, but I'm not sure the benefit people would gain from them would be as high.

Ekleog avatar Sep 29 '22 14:09 Ekleog

Hmm, if you guys are running into this being an issue, then you are probably creating and destroying significant amounts of VecAssemblers. I'm guessing one per function created? Reserving capacity would likely reduce time spent in that but I'd think that even more gains would be possible by keeping the VecAssembler around after a function and just clearing it. That would also save on the allocation routines.

Not sure if there's an api for that right now, but this would probably be more intuitive to use than some guess of how much too reserve.

CensoredUsername avatar Sep 30 '22 01:09 CensoredUsername

Hm, so looking at the API,

pub fn finalize(self) -> Result<Vec<u8>, DynasmError>

taking the ownership of the VecAssembler is a problem. If it was changed (or an alternative method was added) to take a &mut self instead and cleared the internal vectors before returning the self.ops that would probably work quite well too. However, that would still be at least slightly inefficient due to reallocations that would inevitably happen on the first time a larger function is encountered.

On the other hand, a brief look at the implementation of VecAssembler does suggest that implementing a custom storage here is feasible (isn’t it?), so any use-cases that do truly require cutting-edge throughput numbers (and getting those upstream is non-trivial, has implications on the API complexity) might consider their own implementation of the DynAsmApi trait?

nagisa avatar Sep 30 '22 11:09 nagisa

As @nagisa said I don’t think there’s a way to reuse a backing Vec right now, and it makes sense to me given VecAssembler semantically has ownership of the Vec and relinquishes it upon .finalize(). An alternative to .reserve_ops functions would be a new_with_capacity builder, but I think it’s less flexible and yet more API surface that’d be exposed.

Also, just implementing our own DynasmApi would not really be enough: while it would indeed help for the code Vec, we’d still have trouble with the LabelRegistry and the RelocsRegistry having Vec’s that can’t be .reserve()d (unless the other parts of this PR did land).

Ekleog-NEAR avatar Sep 30 '22 14:09 Ekleog-NEAR

I'm running into a similar issue: I'm compiling many tiny functions, and the overhead of allocation and mprotect is a big deal.

Screen Shot 2022-10-09 at 1 31 28 PM

The idea of implementing a custom DynasmApi is interesting, since I'm already not using any labels (due to #69 )...

mkeeter avatar Oct 09 '22 17:10 mkeeter

Are you sure you’re using VecAssembler? VecAssembler should not result in any mprotect calls (and could be the first thing to try to improve the performance in your project)

nagisa avatar Oct 09 '22 18:10 nagisa

Good catch, the screenshot was using the stock Assembler; my comment was more of a +1 to the notion of reusing structures in general.

I've now switched to a VecAssembler and am attempting to reuse memmap2 regions manually, which is failing in various cryptic and hilarious ways (which are probably my fault, so I won't clutter the issue with them).

mkeeter avatar Oct 10 '22 14:10 mkeeter

@mkeeter if you're noticing too much mprotect overhead with the stock assemblers the best thing to look at is reducing the amount of time you call commit (you don't have to call it after every function, you only need it before you're intending to actually run newly jit'ed code) or splitting the assembled code into multiple chunks which do not require to keep constant offsets between so you don't end up having to mprotect the entire buffer all the time, as that tends to be an O(size_of_buffer) operation due to the required pagetable editing. It's probably possible to mprotect some parts of the internal buffer but memmap2 doesn't expose that.

@nagisa: I'm planning to do the following:

  • Add an API to LabelRegistry to pre-allocate a certain amount of labels in the different stores
  • Add an API to LabelRegistry to logically clear its contents while retaining the backing buffer, akin to Vec::clear()
  • Add an API to VecAssembler akin to Vec::clear(), which clears the internal labels and buffer
  • Add an API to VecAssembler akin to Vec::drain() combined with .finalize(), as well as one akin to mem::take()ing the internal buffer after finalizing. So if you truly want no new allocations you can do your_own_buf.extend(VecAssembler::drain())
  • Change the hasher used in LabelRegistry to be one faster for small strings, and probably not a cryptographically resilient one. This shouldn't be a problem as normally this API is only called with compile-time fixed strings.

These should all result in improved performance and allow eliminating allocations after setup.

CensoredUsername avatar Oct 10 '22 19:10 CensoredUsername

These changes are now live on dev, would anyone be able to provide some benchmarks to judge if there's still some performance left on the table? (particularly, how much of the assembling time is now spent on label resolution and any remaining allocation.).

CensoredUsername avatar Oct 10 '22 21:10 CensoredUsername

Cool, I'll take a look in the next day or two and post some benchmarks. In case you're curious, my hilarious failures turned out to be dcache and icache getting out of sync 😱 (https://github.com/RazrFalcon/memmap2-rs/issues/62)

mkeeter avatar Oct 11 '22 01:10 mkeeter

@mkeeter ah yeah that's a pain on aarch64. I'm still working on proper handling of that myself (see the next_major_release branch). It is quite the pain, especially since you have to do it on a cache line basis and big.LITTLE chips exist where the cache line size isn't even consistent across cores on the same processor: see https://www.mono-project.com/news/2016/09/12/arm64-icache/

CensoredUsername avatar Oct 11 '22 09:10 CensoredUsername

The changes on dev make a meaningful difference! Here's what I'm seeing on one test case, logging time spent in the JIT:

  • Version 1.2.3: 19.2 ms
  • dev: 13.7 ms
  • dev (with hand-computed jumps): 7.85 ms

In this one test case, the overhead is reduced by about 50%, which is definitely significant.

mkeeter avatar Oct 11 '22 12:10 mkeeter

Interesting. Is this just with a drop-in replacement (different hashing algorithm) or is this including reuse of VecAssemblers?

CensoredUsername avatar Oct 11 '22 13:10 CensoredUsername

This is just swapping dynasm = 1.2.3 to dynasm = { git = ..., branch = "dev" } in Cargo.toml, tested on an older branch that was using the stock Assembler.

(Right now, I'm using a VecAssembler, and manually managing memory maps to allow for reuse. I'm planning to switch away from memmap2 in favor of manually using libc::mmap and using the W^X strategy from this article)

mkeeter avatar Oct 11 '22 14:10 mkeeter

Neat, that speedup was then just the improved hasher. The stock Assembler shouldn't have any perf issues due to allocations to begin with as it was always reusing them.

CensoredUsername avatar Oct 11 '22 16:10 CensoredUsername

On my (real-world-ish) test case, here are the performance results:

  • Before any change: 28.5-29ms
  • By allocating only the label hashmaps that actually deserve it (ie. using this PR): 22.5-23ms
  • Without any allocation, but with a dynasm upgrade to dev: 27.5-28ms
  • By allocating using new_with_capacity, overallocating the label types we don’t need (as new_with_capacity doesn’t allow hand-picking which label types to preallocate for): 24-24.5ms
  • By allocating only code using new_with_capacity: 25-25.5ms
  • By reusing the VecAssembler using take(): 23-23.5ms
  • By reusing the VecAssembler using drain(): 19.5-20ms
  • By reusing the VecAssembler using .take() plus pre-allocating using a .reserve() taken from this PR: 20-20.5ms

So, my interpretation is:

  • The fnv switch gets us ~3%
  • Preallocating code, labels and relocs of only the types requested gets us ~20%
  • Preallocating just code gets us ~9%
  • Preallocating code and labels of all the types (with the new_with_capacity API) gets us ~13%
  • Reusing the VecAssembler using .take() gets us ~16%
  • Reusing the VecAssembler using .drain() gets us ~28%
  • Reusing the VecAssembler using .take() plus pre-allocating using a .reserve() taken from this PR, which I’d expect to be faster as it removes a copy, actually ends up slower; not sure why exactly. Might be a side-effect from the fact that all my functions are pretty small and all the same size, and so my "random guess" 100k preallocation for each function that I used in order to avoid overfitting my test is way oversized and reusing the same VecAssembler consumes less memory

So overall I’m happy with this commit! Though I would kind of have liked to have a .reserve() function for the code vec at least, as the compiler could often guess how much assembly it’s going to emit based on the size of its input… it’s probably not a big deal. The only thing that makes me sad is that I’ll have to figure out a way to reuse the VecAssembler within rayon, which’ll be… fun… but that is my problem to handle 😅

Thank you for your changes! Out of curiosity, when do you think they could be released? :)

Ekleog-NEAR avatar Oct 12 '22 13:10 Ekleog-NEAR

The only thing that makes me sad is that I’ll have to figure out a way to reuse the VecAssembler within rayon, which’ll be… fun… but that is my problem to handle 😅

You probably just want one per thread, or task unit you're throwing at rayon.

Interesting to see the differences between your respective workloads by the way. @mkeeter is seeing almost all overhead in label hashing, wile in your case. Adding a .reserve() for the backing vec won't be much of a problem, I'll just add that one as well. If you already have a with_capacity it feels quite natural anyway.

The perf difference makes sense though, as the normal Assembler already had all the niceties that VecAssembler now has as well internally so he could only be seeing remaining perf gains in label hashing. Outside of that I'd say that @Ekleog-NEAR is doing much more work while assembling, so the percentual perf increase from the label hashing is more minimal.

I'll look into providing a small set of non-hashmap lookup labels akin to lua-dynasm's local labels, then we can see if the overhead is still from the remaining hashing or if it is from just the cost of relocation interpolation compared to compile-time resolution.

As for when it's released, I've been a bit busy recently, but I'm aiming to work through the bug reports to some of my libraries the next two weeks. Your quick response with the benchmarks is greatly appreciated for that! If I can get it done in time I'd like to do a new major release with the next-major-release changes as well, but those still need some looking at as currently I don't have a good aarch64 box to test cache invalidation shenanigans with. If any of you knows a decently cheap VPS for a bit of work on one of those please tell me.

CensoredUsername avatar Oct 12 '22 15:10 CensoredUsername

Awesome thank you! As for a cheap aarch64 box, I unfortunately don't know of a provider, but I do have an aarch64 mac at hand if you want me to test things, just give me the command line to run and I can try it out and report :)

Ekleog avatar Oct 12 '22 15:10 Ekleog

https://cfarm.tetaneutral.net/ exists exactly for that purpose and has a variety of aarch64 machines. Try there, perhaps?

nagisa avatar Oct 12 '22 15:10 nagisa