WIP: BatchIt
As part of the assessment of #634, but also perhaps more generally useful. Opinions welcome.
Looks like it is leaking in some case https://github.com/microsoft/snmalloc/actions/runs/6279469686/job/17055222812?pr=637#step:7:149
Looks like it is leaking in some case https://github.com/microsoft/snmalloc/actions/runs/6279469686/job/17055222812?pr=637#step:7:149
Whoops; I had the loop termination conditions wrong. They're fixed now, I think. Let's see if CI agrees.
After discussions with @mjp41 yesterday, I've introduced a notion of "tweakable obfuscation" and have made all the intra-slab free lists' backwards signatures use the address of the slab metadata as the "tweak". The next step would be to remove the per-thread keys and have everyone use a common global key (probably not RemoteAllocator::key_global!) and apply the same tweaking. This opens the door to sending threads being able to build up segments of slab free lists that can be spliced in by the recipient in O(1) rather than O(n).
I've (at long last) got things flying end to end with a very simple "cache" on the sending side -- a single open ring -- but I think some review and investigation is a good idea. Here's what mimalloc-bench makes of the current state of things in terms of time
and memory
We should:
- figure out how to make the caching layer optional, which probably just means some more
std::conditional_tuse. - get someone with chops to assess the "tweaked obfuscation" changes.
- offer to randomize deallocator cache construction order (Matt writes: "as we are building a ring, we can add to the start or the end, so perhaps we could at least build an unpredictable order in the ring")
- offer probabilistic premature eviction from the deallocator caches to further thwart attempts to control free-list order
Two things to address:
- Randomisation - this might break some of the randomisation, can we use the ways to build multiple queues for the same slab.
- Can we disable this feature as feature flag and constexpr/conditional_t, so we can analyse performance more in the future.
Just rebasing after #659 landed. Todo-s remain to be addressed.
And the novelty of [[no_unique_address]] continues to sting. Hm.
Start addressing to-dos, specifically being able to turn BatchIt off: rewrite history to have always had a RemoteDeallocCacheBatching structure that encapsulates the client-side logic. We can pair this with the current RemoteMessage structure, then add parallel non-batching implementations of the client-side logic and the RemoteMessage internals.
OK, well, apparently MAX_CAPACITY_BITS needs to be at least 17 - 4 = 13, were it to be universal: 64-bit cross-builds under qemu use a MIN_CHUNK_SIZE of 17 and a smallest sizeclass of 16 bytes.
But that breaks all the 32-bit builds, because MAX_SMALL_SIZECLASS_BITS is 16 and MIN_OBJECT_COUNT is sometimes 13, and 16 + ceil(log_2(13)) + 13 is 33.
I'm not sure why some other Windows builds are still broken.