Changes BytesMut not to rely on Vec + Arc
This changes the internal BytesMut representation to allocate memory on its own, and not to use Arc + Vec under the hood.
Doing this allows to have a single representation which is shareable (by
being able to modify an embedded refcount) with zero-cost right from
the start. No representation changes have to be performed over the
lifetime of BytesMut. Therefore BytesMut::freeze() is now also free.
The change also opens the door for adding support for buffer pools or custom allocators, since an allocator is directly called. Adding this support would mostly mean having to pass an allocator to BytesMut which is used called instead of a fixed function. This would however be a different change.
Fixes #401
cc/ @seanmonstar who did most of this work most recently.
I'm now rather happy with this, so I removed the WIP.
I added some more documentation around the memory allocation strategy - based on my understanding of previous versions - and questions on what we should do there.
I'm somewhat inclined to drop the behavior of taking capacity that an old backing storage had into account for reservations - it just seems like a footgun leading to high memory usage in case people split BytesMut and work on them later in individually.
Not sure what is going on with Loom. I think it doesn't like my use of alloc::alloc::alloc (direct use of the global allocator)- but don't know how to fix it.
cargo bench comparison:
name old ns/iter new ns/iter diff ns/iter diff % speedup
alloc_big 52 53 1 1.92% x 0.98
alloc_mid 53 51 -2 -3.77% x 1.04
alloc_small 53,018 51,740 -1,278 -2.41% x 1.02
alloc_write_split_to_mid 104 67 -37 -35.58% x 1.55
bytes_mut_extend 568 (225 MB/s) 538 (237 MB/s) -30 -5.28% x 1.06
clone_frozen 9,711 9,785 74 0.76% x 0.99
deref_shared 242 241 -1 -0.41% x 1.00
deref_two 237 235 -2 -0.84% x 1.01
deref_unique 239 240 1 0.42% x 1.00
deref_unique_unroll 257 258 1 0.39% x 1.00
drain_write_drain 298 228 -70 -23.49% x 1.31
fmt_write 14 (2642 MB/s) 14 (2642 MB/s) 0 0.00% x 1.00
put_slice_bytes_mut 15 (8533 MB/s) 14 (9142 MB/s) -1 -6.67% x 1.07
put_slice_vec 16 (8000 MB/s) 15 (8533 MB/s) -1 -6.25% x 1.07
put_slice_vec_extend 10 (12800 MB/s) 10 (12800 MB/s) 0 0.00% x 1.00
put_u8_bytes_mut 508 (251 MB/s) 454 (281 MB/s) -54 -10.63% x 1.12
put_u8_vec 538 (237 MB/s) 513 (249 MB/s) -25 -4.65% x 1.05
put_u8_vec_push 128 (1000 MB/s) 127 (1007 MB/s) -1 -0.78% x 1.01
It looks like you are inlining the capacity with the data? Given that buffers are often allocated to be page length, would this not result in odd allocation patterns? The size would often be size_of::<Shared>() + PAGE.
Looking through this, I believe this should be able to land w/o a breaking change. I don't think we need to block 0.6 on this.
edit: fixed the version number.
It looks like you are inlining the capacity with the data? Given that buffers are often allocated to be page length, would this not result in odd allocation patterns? The size would often be
size_of::<Shared>() + PAGE.
Are you aware about any applications which do that? I don't think you can currently rely on that, since your system memory allocator might also not do the same - e.g. it might also need to store it's metadata inside the allocation header and therefore take size away from it. But it's true that buffer sizes which are multiples of 4k are often programmers favorites.
What we could do is maybe rounding up allocations to page size, so that the capacity isn't lost? But that might require adding a with_exact_capacity method for use-cases which don't want that.
regarding
It looks like you are inlining the capacity with the data? Given that buffers are often allocated to be page length, would this not result in odd allocation patterns? The size would often be size_of::<Shared>() + PAGE.
I confirmed that e.g. mimalloc will optimize fixed sized chunks, and Shared would might mean getting pushed to the next chunk size.
Based on this I see 2 ways forward for this:
- Don't care about it in the normal case, and add more optimized APIs for people that care about the differences.E.g.
BytesMut::with_exact_capacity(size)would return exactly what the user requires (even it if it is not optimal),BytesMut::with_optimized_capacity(size)could either round up (to next multiple of 2) or down (subtract- size_of::<Shared>). - Add another pointer to
Bytesitself which tracks the shared state separately from the data. The benefit of this is that this could also eliminate the special casing forVecinBytes, since the field can be used to store the capacity. The downsides are however:
- 2 allocations for
BytesMutinstead of. - 2 times the pointer chasing on resizes/frees/etc.
I guess for the use-case of "have a small serialization buffer and continuously add stuff" approach 2) will end up slower than 1) (and this CR). However it's more deterministic for managing large chunks of data.
Do we know why we can't entirely hide the the Arc / Vec implementation behind the vtable? IIRC it is due to wanting to avoid an allocation when converting Vec and needing to track the pointer to the head of the memory as well as the original capacity? Is there another reason that I am forgetting?
Would this also allow for free Bytes::try_mut()? #368