ringbuffer icon indicating copy to clipboard operation
ringbuffer copied to clipboard

Increse copy speed by orders of magnitude

Open cdellacqua opened this issue 10 months ago • 5 comments

Abstract

When T implements Copy, we can use the std/core method copy_from_slice to offload the data transfer to very optimized and potentially platform-specific functions.

Backstory

I was troubleshooting some code that deals with a huge ringbuffer (1mln f32s), where the most common operation is copying the last 2 thousand elements.

After some profiling, I found that the slowest operation was just skipping and iterating over the elements that I needed to copy out of the buffer.

I experimented with the built-in copy_from_slice, which under the hood calls memcpy, and I got these results:

baseline memcpy
debug ~30ms ~5μs
release ~1ms ~2μs

The baseline consists of using this:

let mut out = Vec::with_capacity(2000);
out.extend(ringbuffer.iter().skip(tons_of_items_to_skip).take(2000).copied());

While the code changes in this PR allow doing this:

let mut out = vec[0; 2000];
ringbuffer.copy_to_slice(tons_of_items_to_skip, &mut out); // one or two memcpy depending on the readptr position

The results are less impressive when working on the entire buffer, but still noticeable (benchmarks below).

Proposed solution

I've added two methods: copy_from_slice and copy_to_slice to the RingBuffer trait.

How it works

For ConstGeneric and Alloc buffers, copy_from_slice works by taking the pointer to the first relevant byte of the ringbuffer. It then checks whether the &slice fits a contiguous region of memory. If it does, then a single copy operation is performed. If it doesn't, the copy is split into the two halves.

copy_to_slice works the same way but inverting the destination and source slices.

VecDequeue has a simpler (and safe) implementation based on the built-in methods as_slices()/as_slices_mut().

Benchmark

I've added some tests and run them with criterion. Here are some relevant results:

copy_to_slice vs extend on a pre-allocated Vec with 1_000_000 elements

image image

copy_to_slice vs extend on a pre-allocated Vec with 16 elements

image image

I made sure to pre-allocate everything and, assuming I did it correctly, the speed-up looks quite substantial!

On this note, I added an unsafe set_len method to ConstGeneric and Alloc ring buffers that mimics what Vec::set_len does. It provides a nice way to "empty" a buffer of primitives by simply moving the buffer writeptr, without incurring the penalty of iterating over all the elements to call Drop::drop. Just like Vec::set_len this method can leak, as stated in the doc comment.

cdellacqua avatar Jan 31 '25 16:01 cdellacqua

Seems like reasonable changes, also well tested and I like the perf bonus. lemme quickly approve CI and see if miri still passes but then it's ok by me. @NULLx76 ?

jdonszelmann avatar Feb 03 '25 17:02 jdonszelmann

@cdellacqua I'm afraid, however nice, it fails miri. Want to take a look at it?

jdonszelmann avatar Feb 03 '25 17:02 jdonszelmann

coverage is no prob for now, that's just a deprecated line in the ci workflow definition

jdonszelmann avatar Feb 03 '25 17:02 jdonszelmann

Good catch!

I previously used the first element in the buffer (accessible by get_unchecked) to retrieve a pointer to the base. With the latest commit I added a dedicated function that doesn't index into the const generic buffer, avoiding the retagging issue

cdellacqua avatar Feb 03 '25 18:02 cdellacqua

I've been experimenting with an extend_from_slice and a drain_to_slice method. They're based on the same concept of copy_from_slice and copy_to_slice, except they move the read and write pointers accordingly. Benchmarks show similar improvements.

Would you be interested in this feature too? In that case, let me know if you'd like to slightly expand the scope of this PR or if you'd rather have a separate PR

cdellacqua avatar Feb 07 '25 15:02 cdellacqua