bytes icon indicating copy to clipboard operation
bytes copied to clipboard

Changes BytesMut not to rely on Vec + Arc

Open Matthias247 opened this issue 5 years ago • 10 comments

This changes the internal BytesMut representation to allocate memory on its own, and not to use Arc + Vec under the hood.

Doing this allows to have a single representation which is shareable (by being able to modify an embedded refcount) with zero-cost right from the start. No representation changes have to be performed over the lifetime of BytesMut. Therefore BytesMut::freeze() is now also free.

The change also opens the door for adding support for buffer pools or custom allocators, since an allocator is directly called. Adding this support would mostly mean having to pass an allocator to BytesMut which is used called instead of a fixed function. This would however be a different change.

Fixes #401

Matthias247 avatar Oct 18 '20 04:10 Matthias247

cc/ @seanmonstar who did most of this work most recently.

carllerche avatar Oct 18 '20 16:10 carllerche

I'm now rather happy with this, so I removed the WIP.

I added some more documentation around the memory allocation strategy - based on my understanding of previous versions - and questions on what we should do there. I'm somewhat inclined to drop the behavior of taking capacity that an old backing storage had into account for reservations - it just seems like a footgun leading to high memory usage in case people split BytesMut and work on them later in individually.

Matthias247 avatar Oct 18 '20 18:10 Matthias247

Not sure what is going on with Loom. I think it doesn't like my use of alloc::alloc::alloc (direct use of the global allocator)- but don't know how to fix it.

Matthias247 avatar Oct 18 '20 19:10 Matthias247

cargo bench comparison:

name                      old ns/iter      new ns/iter      diff ns/iter   diff %  speedup 
 alloc_big                 52               53                          1    1.92%   x 0.98
 alloc_mid                 53               51                         -2   -3.77%   x 1.04
 alloc_small               53,018           51,740                 -1,278   -2.41%   x 1.02
 alloc_write_split_to_mid  104              67                        -37  -35.58%   x 1.55
 bytes_mut_extend          568 (225 MB/s)   538 (237 MB/s)            -30   -5.28%   x 1.06 
 clone_frozen              9,711            9,785                      74    0.76%   x 0.99
 deref_shared              242              241                        -1   -0.41%   x 1.00
 deref_two                 237              235                        -2   -0.84%   x 1.01
 deref_unique              239              240                         1    0.42%   x 1.00
 deref_unique_unroll       257              258                         1    0.39%   x 1.00
 drain_write_drain         298              228                       -70  -23.49%   x 1.31 
 fmt_write                 14 (2642 MB/s)   14 (2642 MB/s)              0    0.00%   x 1.00
 put_slice_bytes_mut       15 (8533 MB/s)   14 (9142 MB/s)             -1   -6.67%   x 1.07
 put_slice_vec             16 (8000 MB/s)   15 (8533 MB/s)             -1   -6.25%   x 1.07
 put_slice_vec_extend      10 (12800 MB/s)  10 (12800 MB/s)             0    0.00%   x 1.00
 put_u8_bytes_mut          508 (251 MB/s)   454 (281 MB/s)            -54  -10.63%   x 1.12
 put_u8_vec                538 (237 MB/s)   513 (249 MB/s)            -25   -4.65%   x 1.05
 put_u8_vec_push           128 (1000 MB/s)  127 (1007 MB/s)            -1   -0.78%   x 1.01

Matthias247 avatar Oct 18 '20 21:10 Matthias247

It looks like you are inlining the capacity with the data? Given that buffers are often allocated to be page length, would this not result in odd allocation patterns? The size would often be size_of::<Shared>() + PAGE.

carllerche avatar Oct 20 '20 16:10 carllerche

Looking through this, I believe this should be able to land w/o a breaking change. I don't think we need to block 0.6 on this.

edit: fixed the version number.

carllerche avatar Oct 20 '20 16:10 carllerche

It looks like you are inlining the capacity with the data? Given that buffers are often allocated to be page length, would this not result in odd allocation patterns? The size would often be size_of::<Shared>() + PAGE.

Are you aware about any applications which do that? I don't think you can currently rely on that, since your system memory allocator might also not do the same - e.g. it might also need to store it's metadata inside the allocation header and therefore take size away from it. But it's true that buffer sizes which are multiples of 4k are often programmers favorites.

What we could do is maybe rounding up allocations to page size, so that the capacity isn't lost? But that might require adding a with_exact_capacity method for use-cases which don't want that.

Matthias247 avatar Oct 20 '20 22:10 Matthias247

regarding

It looks like you are inlining the capacity with the data? Given that buffers are often allocated to be page length, would this not result in odd allocation patterns? The size would often be size_of::<Shared>() + PAGE.

I confirmed that e.g. mimalloc will optimize fixed sized chunks, and Shared would might mean getting pushed to the next chunk size.

Based on this I see 2 ways forward for this:

  1. Don't care about it in the normal case, and add more optimized APIs for people that care about the differences.E.g. BytesMut::with_exact_capacity(size) would return exactly what the user requires (even it if it is not optimal), BytesMut::with_optimized_capacity(size) could either round up (to next multiple of 2) or down (subtract - size_of::<Shared>).
  2. Add another pointer to Bytes itself which tracks the shared state separately from the data. The benefit of this is that this could also eliminate the special casing for Vec in Bytes, since the field can be used to store the capacity. The downsides are however:
  • 2 allocations for BytesMut instead of.
  • 2 times the pointer chasing on resizes/frees/etc.

I guess for the use-case of "have a small serialization buffer and continuously add stuff" approach 2) will end up slower than 1) (and this CR). However it's more deterministic for managing large chunks of data.

Matthias247 avatar Jan 04 '21 19:01 Matthias247

Do we know why we can't entirely hide the the Arc / Vec implementation behind the vtable? IIRC it is due to wanting to avoid an allocation when converting Vec and needing to track the pointer to the head of the memory as well as the original capacity? Is there another reason that I am forgetting?

carllerche avatar Jan 04 '21 21:01 carllerche

Would this also allow for free Bytes::try_mut()? #368

qm3ster avatar Feb 01 '22 18:02 qm3ster