Increse copy speed by orders of magnitude
Abstract
When T implements Copy, we can use the std/core method copy_from_slice to offload the data transfer to very optimized and potentially platform-specific functions.
Backstory
I was troubleshooting some code that deals with a huge ringbuffer (1mln f32s), where the most common operation is copying the last 2 thousand elements.
After some profiling, I found that the slowest operation was just skipping and iterating over the elements that I needed to copy out of the buffer.
I experimented with the built-in copy_from_slice, which under the hood calls memcpy, and I got these results:
| baseline | memcpy | |
|---|---|---|
| debug | ~30ms | ~5μs |
| release | ~1ms | ~2μs |
The baseline consists of using this:
let mut out = Vec::with_capacity(2000);
out.extend(ringbuffer.iter().skip(tons_of_items_to_skip).take(2000).copied());
While the code changes in this PR allow doing this:
let mut out = vec[0; 2000];
ringbuffer.copy_to_slice(tons_of_items_to_skip, &mut out); // one or two memcpy depending on the readptr position
The results are less impressive when working on the entire buffer, but still noticeable (benchmarks below).
Proposed solution
I've added two methods: copy_from_slice and copy_to_slice to the RingBuffer trait.
How it works
For ConstGeneric and Alloc buffers, copy_from_slice works by taking the pointer to the first relevant byte of the ringbuffer. It then checks whether the &slice fits a contiguous region of memory. If it does, then a single copy operation is performed. If it doesn't, the copy is split into the two halves.
copy_to_slice works the same way but inverting the destination and source slices.
VecDequeue has a simpler (and safe) implementation based on the built-in methods as_slices()/as_slices_mut().
Benchmark
I've added some tests and run them with criterion. Here are some relevant results:
copy_to_slice vs extend on a pre-allocated Vec with 1_000_000 elements
copy_to_slice vs extend on a pre-allocated Vec with 16 elements
I made sure to pre-allocate everything and, assuming I did it correctly, the speed-up looks quite substantial!
On this note, I added an unsafe set_len method to ConstGeneric and Alloc ring buffers that mimics what Vec::set_len does. It provides a nice way to "empty" a buffer of primitives by simply moving the buffer writeptr, without incurring the penalty of iterating over all the elements to call Drop::drop. Just like Vec::set_len this method can leak, as stated in the doc comment.
Seems like reasonable changes, also well tested and I like the perf bonus. lemme quickly approve CI and see if miri still passes but then it's ok by me. @NULLx76 ?
@cdellacqua I'm afraid, however nice, it fails miri. Want to take a look at it?
coverage is no prob for now, that's just a deprecated line in the ci workflow definition
Good catch!
I previously used the first element in the buffer (accessible by get_unchecked) to retrieve a pointer to the base. With the latest commit I added a dedicated function that doesn't index into the const generic buffer, avoiding the retagging issue
I've been experimenting with an extend_from_slice and a drain_to_slice method. They're based on the same concept of copy_from_slice and copy_to_slice, except they move the read and write pointers accordingly. Benchmarks show similar improvements.
Would you be interested in this feature too? In that case, let me know if you'd like to slightly expand the scope of this PR or if you'd rather have a separate PR