dynasm-rs
dynasm-rs copied to clipboard
Allow reserving space in various Vec and HashMaps for performance
Exposing these interfaces gives a significant performance boost in some use cases, as per the numbers listed on https://github.com/near/wasmer/pull/142: as much as a 20-25% performance gain over the whole compiler on some benchmarks (that made me notice such an optimization would be useful), without even fine-tuning parameters.
The thing I'm least certain about is exposing the relocs
and relocs_mut
functions, but they seem to make sense to me given how the labels
and labels_mut
are already exposed.
What do you think about this? :)
Oh, and I forgot: technically, near/wasmer#142 only uses VecAssembler::reserve_ops
, LabelRegistry::reserve_dynamic
and RelocRegistry::reserve_dynamic
because they were the only ones that appeared in profiles. I'm exposing other methods here too, because I'm thinking we may as well just expose them in case someone wants to try them out, but I'm not sure the benefit people would gain from them would be as high.
Hmm, if you guys are running into this being an issue, then you are probably creating and destroying significant amounts of VecAssemblers. I'm guessing one per function created? Reserving capacity would likely reduce time spent in that but I'd think that even more gains would be possible by keeping the VecAssembler around after a function and just clearing it. That would also save on the allocation routines.
Not sure if there's an api for that right now, but this would probably be more intuitive to use than some guess of how much too reserve.
Hm, so looking at the API,
pub fn finalize(self) -> Result<Vec<u8>, DynasmError>
taking the ownership of the VecAssembler
is a problem. If it was changed (or an alternative method was added) to take a &mut self
instead and cleared the internal vectors before returning the self.ops
that would probably work quite well too. However, that would still be at least slightly inefficient due to reallocations that would inevitably happen on the first time a larger function is encountered.
On the other hand, a brief look at the implementation of VecAssembler
does suggest that implementing a custom storage here is feasible (isn’t it?), so any use-cases that do truly require cutting-edge throughput numbers (and getting those upstream is non-trivial, has implications on the API complexity) might consider their own implementation of the DynAsmApi
trait?
As @nagisa said I don’t think there’s a way to reuse a backing Vec
right now, and it makes sense to me given VecAssembler
semantically has ownership of the Vec
and relinquishes it upon .finalize()
. An alternative to .reserve_ops
functions would be a new_with_capacity
builder, but I think it’s less flexible and yet more API surface that’d be exposed.
Also, just implementing our own DynasmApi
would not really be enough: while it would indeed help for the code Vec, we’d still have trouble with the LabelRegistry and the RelocsRegistry having Vec’s that can’t be .reserve()d (unless the other parts of this PR did land).
I'm running into a similar issue: I'm compiling many tiny functions, and the overhead of allocation and mprotect
is a big deal.

The idea of implementing a custom DynasmApi
is interesting, since I'm already not using any labels (due to #69 )...
Are you sure you’re using VecAssembler
? VecAssembler
should not result in any mprotect
calls (and could be the first thing to try to improve the performance in your project)
Good catch, the screenshot was using the stock Assembler
; my comment was more of a +1 to the notion of reusing structures in general.
I've now switched to a VecAssembler
and am attempting to reuse memmap2
regions manually, which is failing in various cryptic and hilarious ways (which are probably my fault, so I won't clutter the issue with them).
@mkeeter if you're noticing too much mprotect overhead with the stock assemblers the best thing to look at is reducing the amount of time you call commit (you don't have to call it after every function, you only need it before you're intending to actually run newly jit'ed code) or splitting the assembled code into multiple chunks which do not require to keep constant offsets between so you don't end up having to mprotect the entire buffer all the time, as that tends to be an O(size_of_buffer) operation due to the required pagetable editing. It's probably possible to mprotect some parts of the internal buffer but memmap2 doesn't expose that.
@nagisa: I'm planning to do the following:
- Add an API to LabelRegistry to pre-allocate a certain amount of labels in the different stores
- Add an API to LabelRegistry to logically clear its contents while retaining the backing buffer, akin to Vec::clear()
- Add an API to VecAssembler akin to Vec::clear(), which clears the internal labels and buffer
- Add an API to VecAssembler akin to Vec::drain() combined with .finalize(), as well as one akin to mem::take()ing the internal buffer after finalizing. So if you truly want no new allocations you can do your_own_buf.extend(VecAssembler::drain())
- Change the hasher used in LabelRegistry to be one faster for small strings, and probably not a cryptographically resilient one. This shouldn't be a problem as normally this API is only called with compile-time fixed strings.
These should all result in improved performance and allow eliminating allocations after setup.
These changes are now live on dev, would anyone be able to provide some benchmarks to judge if there's still some performance left on the table? (particularly, how much of the assembling time is now spent on label resolution and any remaining allocation.).
Cool, I'll take a look in the next day or two and post some benchmarks. In case you're curious, my hilarious failures turned out to be dcache
and icache
getting out of sync 😱 (https://github.com/RazrFalcon/memmap2-rs/issues/62)
@mkeeter ah yeah that's a pain on aarch64. I'm still working on proper handling of that myself (see the next_major_release branch). It is quite the pain, especially since you have to do it on a cache line basis and big.LITTLE chips exist where the cache line size isn't even consistent across cores on the same processor: see https://www.mono-project.com/news/2016/09/12/arm64-icache/
The changes on dev
make a meaningful difference! Here's what I'm seeing on one test case, logging time spent in the JIT:
- Version 1.2.3: 19.2 ms
-
dev
: 13.7 ms -
dev
(with hand-computed jumps): 7.85 ms
In this one test case, the overhead is reduced by about 50%, which is definitely significant.
Interesting. Is this just with a drop-in replacement (different hashing algorithm) or is this including reuse of VecAssemblers?
This is just swapping dynasm = 1.2.3
to dynasm = { git = ..., branch = "dev" }
in Cargo.toml
, tested on an older branch that was using the stock Assembler
.
(Right now, I'm using a VecAssembler
, and manually managing memory maps to allow for reuse. I'm planning to switch away from memmap2
in favor of manually using libc::mmap
and using the W^X strategy from this article)
Neat, that speedup was then just the improved hasher. The stock Assembler
shouldn't have any perf issues due to allocations to begin with as it was always reusing them.
On my (real-world-ish) test case, here are the performance results:
- Before any change: 28.5-29ms
- By allocating only the label hashmaps that actually deserve it (ie. using this PR): 22.5-23ms
- Without any allocation, but with a dynasm upgrade to
dev
: 27.5-28ms - By allocating using new_with_capacity, overallocating the label types we don’t need (as new_with_capacity doesn’t allow hand-picking which label types to preallocate for): 24-24.5ms
- By allocating only code using new_with_capacity: 25-25.5ms
- By reusing the
VecAssembler
usingtake()
: 23-23.5ms - By reusing the
VecAssembler
usingdrain()
: 19.5-20ms - By reusing the VecAssembler using
.take()
plus pre-allocating using a.reserve()
taken from this PR: 20-20.5ms
So, my interpretation is:
- The
fnv
switch gets us ~3% - Preallocating code, labels and relocs of only the types requested gets us ~20%
- Preallocating just code gets us ~9%
- Preallocating code and labels of all the types (with the new_with_capacity API) gets us ~13%
- Reusing the VecAssembler using
.take()
gets us ~16% - Reusing the VecAssembler using
.drain()
gets us ~28% - Reusing the VecAssembler using
.take()
plus pre-allocating using a.reserve()
taken from this PR, which I’d expect to be faster as it removes a copy, actually ends up slower; not sure why exactly. Might be a side-effect from the fact that all my functions are pretty small and all the same size, and so my "random guess" 100k preallocation for each function that I used in order to avoid overfitting my test is way oversized and reusing the same VecAssembler consumes less memory
So overall I’m happy with this commit! Though I would kind of have liked to have a .reserve()
function for the code vec at least, as the compiler could often guess how much assembly it’s going to emit based on the size of its input… it’s probably not a big deal. The only thing that makes me sad is that I’ll have to figure out a way to reuse the VecAssembler
within rayon, which’ll be… fun… but that is my problem to handle 😅
Thank you for your changes! Out of curiosity, when do you think they could be released? :)
The only thing that makes me sad is that I’ll have to figure out a way to reuse the VecAssembler within rayon, which’ll be… fun… but that is my problem to handle 😅
You probably just want one per thread, or task unit you're throwing at rayon.
Interesting to see the differences between your respective workloads by the way. @mkeeter is seeing almost all overhead in label hashing, wile in your case. Adding a .reserve() for the backing vec won't be much of a problem, I'll just add that one as well. If you already have a with_capacity it feels quite natural anyway.
The perf difference makes sense though, as the normal Assembler
already had all the niceties that VecAssembler
now has as well internally so he could only be seeing remaining perf gains in label hashing. Outside of that I'd say that @Ekleog-NEAR is doing much more work while assembling, so the percentual perf increase from the label hashing is more minimal.
I'll look into providing a small set of non-hashmap lookup labels akin to lua-dynasm's local labels, then we can see if the overhead is still from the remaining hashing or if it is from just the cost of relocation interpolation compared to compile-time resolution.
As for when it's released, I've been a bit busy recently, but I'm aiming to work through the bug reports to some of my libraries the next two weeks. Your quick response with the benchmarks is greatly appreciated for that! If I can get it done in time I'd like to do a new major release with the next-major-release
changes as well, but those still need some looking at as currently I don't have a good aarch64 box to test cache invalidation shenanigans with. If any of you knows a decently cheap VPS for a bit of work on one of those please tell me.
Awesome thank you! As for a cheap aarch64 box, I unfortunately don't know of a provider, but I do have an aarch64 mac at hand if you want me to test things, just give me the command line to run and I can try it out and report :)
https://cfarm.tetaneutral.net/ exists exactly for that purpose and has a variety of aarch64 machines. Try there, perhaps?
@CensoredUsername Hey! Just coming to check when you think you’d have a chance to release? (as you mentioned two weeks a month ago I’m just wondering whether maybe you forgot 😅)
(Oh and also just the reserve_ops
introduced in this PR if you’re on the same line as me about the fact that it’d make sense to have it alongside the new_with_capacity
)
@Ekleog-NEAR sorry for the bad estimation, I ended up being busier than expected, and only just got back in the country. Should have some time again now to work on it though 👍
No problem, and thank you for all you do on dynasm-rs anyway! Please let me know if there’s anything I can do to help out :)
I also noticed that we might still be dealing with some unnecessary allocator usage right now, as the way RelocRegistry works right now means it will allocate a whole Vec for every local label reference, even if said reference is only used once. I was first planning to swap those for smallvecs to see the result, but I've just realized that there should be a way to remove these double indirection. Right now local label references are required to be resolved as soon as they are encountered as the information for that local label is thrown away as soon as a new one is defined. However if we were to not just store their name, but also a sort-of version number for that label, for both the references and labels, we should be able to remove those allocations at the price of a single extra hashmap lookup at label definition time to figure out the version.
That sounds great, fewer allocations is always better! Though since reading http://troubles.md/posts/improving-smallvec/ I’ve been a bit doubtful about SmallVec, and have tended to just go with using Vec with custom allocators when faced with performance issues related to lots of small allocations.
That said I’m not sure I really understand how that’d work, as AFAICT labels never interact with RelocRegistry and LabelRegistry doesn’t allocate a Vec per label. So if you want me to help implementing it I’d need a bit more details; if not I don’t need to understand for now and can read your code later to understand :)
@Ekleog-NEAR with the approach I'm working on right now Smallvec shouldn't even be needed to begin with. I'm testing it myself right now, mostly just need to make a few extra test to confirm behaviour didn't change. I'll push a commit to dev so you can also try it after that.
This ended up being a small refactor that allowed some of the runtime to be more simplified, with a common code path for both local and global labels and no more need to often interpolate local labels before commit time. There's now only 3 Vecs and 2 hashmaps with POD values that get appended to when labels are referenced, so hopefully this'll get any remaining perfornance.
Also I just realized, the best way to benchmark the cost of hashing would just be to compare it with using dynamic labels for everything. Those internally are just indexes into an array. @mkeeter maybe something for you to try out already? It should show what the cost is of the interpolation itself (compared to compile-time pre-calculation).
Sounds great to me! I’ll be happy to test with our wasmer as soon as you push the commit to dev; just let me know when you do so :)
As for the cost of hashing, actually what you just said explains why we didn’t see any impact of hashing changes: we’re basically only using dynamic labels in wasmer — though I can’t assert this is true for all the code I can definitely say that all the labels I have seen were dynamic.
It's been pushed @ 181bc13 . It does now also allow you to tune the size allocated to each internal buffer as well. Note that this one does have some breaking changes exposed so it'll have to be a major release when it goes.
A big thing these changes ensure is that defining a local label somewhere and referencing them is now much cheaper. This really affects sequences like the following.
test rbx, rbx
jz >divzero
idiv rbx
divzero:
How this used to work is that at the initial encounter of the divzero label, the runtime would create a new vec of unsolved references to the divzero label, and force an allocation of that by putting the relocation for the jump there. Then when the label was actually encountered it'd remove that entire vec from the unsolved reference hashmap again and process those references immediately. It was a very simple strategy for providing local labels but it did ensure that literally any forward local label caused an allocation there. As this is the most common usecase for these labels this was probably the most significant contributor to allocator thrashing.
How it now works does cause some of the running buffers to grow a bit larger between commits (as every single relocation is now only processed at commit time). but the buffers can be sized appropriately to make that a non-issue.
If you really aren't using any local labels as well then I don't think there's any more performance to be gained from my side. dynamic labels are already just plain indices to an array of to-be-resolved targets. Either way, I await your benchmarks fondly.
Sounds great to me! I’m now having a 21% performance gain compared to the last release (16.4 -> 12.9), or a 31% gain (16.4 -> 11.3) with the additional hack to allow pre-reserving for ops after a .take()
that reset the code vec to 0.
With just the addition of this function, that’d allow removal of this hack from my code, this’d be good to go for me!
Thank you for your help in all this :)
Actually nevermind what I said about the addition of reserve_ops
: with this change I’ve checked and .drain()
does not seem to reduce capacity of the Vec
, so the backing allocation can be reused without having to re-reserve.
I’d still like to have it if possible, because I’m expecting further performance work will be needing it, but even as-is this code is good enough to bring in significant performance improvements for our wasmer, thank you!