aya icon indicating copy to clipboard operation
aya copied to clipboard

aya: Implement RingBuf

Open ishitatsuyuki opened this issue 3 years ago • 23 comments

Based on @willfindlay's branch, fixed the bugs I found and it appears to be working fine for my setup.

The original branch took a callback as a part of the struct, but since this requires 'static lifetime it could be annoying. Instead, take it as the process_ring argument so it only needs to outlive the process call.

Branch based on #290.

Closes: #12 Closes: #201

ishitatsuyuki avatar Jun 01 '22 09:06 ishitatsuyuki

Deploy Preview for aya-rs-docs ready!

Built without sensitive environment variables

Name Link
Latest commit 52b4f44e3e8a1c24ff2e4f6345469c1992e16192
Latest deploy log https://app.netlify.com/sites/aya-rs-docs/deploys/6366735d9368030008590b3e
Deploy Preview https://deploy-preview-294--aya-rs-docs.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

netlify[bot] avatar Jun 01 '22 09:06 netlify[bot]

Doing async with tokio is a little bit more verbose, but I'm doing it like this:

    loop {
        let mut guard = ring.readable_mut().await?;
        guard.get_inner_mut().process_ring(&mut |e| {
            // Do things
            Ok(())
        })?;
        guard.clear_ready();
    }

ishitatsuyuki avatar Jun 01 '22 09:06 ishitatsuyuki

Looks like this can get stuck with epoll (miss the notification somehow). I'll investigate.

ishitatsuyuki avatar Jun 01 '22 13:06 ishitatsuyuki

Throwing SeqCst at the consumer pos store fixes this.

unsafe { (*self.consumer_pos_ptr).store(consumer_pos, Ordering::SeqCst) };

I'm not very good at reasoning with concurrency, but it's possible that the libbpf implementation has bogus synchronization. Cilium (Go) defaults to SeqCst, so it won't see the same issue.

EDIT: it might be due to tokio migrating the read function across thread too, while libbpf essentially assumes single thread polling.

ishitatsuyuki avatar Jun 01 '22 14:06 ishitatsuyuki

Thanks for your review.

I think that in async version of this code we should - instead of using callback - use the the tx side of an mpsc channel.

Using mpsc necessitates a copy — ring processing speed isn't really the concern here, as the application has the freedom to size the ring buffer. (Also mpsc means double ring buffering — not sure that's on point.)

The only other zero-copy option would be returning a guard object that updates consumer pointer on Drop — but that's not much better I'd say.

ishitatsuyuki avatar Jun 01 '22 14:06 ishitatsuyuki

Thanks for taking this over!

willfindlay avatar Jun 01 '22 23:06 willfindlay

Re: the "stuck" polling issue, it's an actual upstream flaw. The ringbuf claims to be designed around acquire-release ordering but it's actually relying on sequential consistency implicitly in the kernel space, while the userspace counterpart is simply busted. Consider the following simplified example, where the "reserve" and "commit" operation is fused and simply increments the producer counter. A notification is sent if and only if the incremented producer counter == consumer counter + 1.

kernel: write p 2 -> read c X -> write p 3 -> read c 1 (X doesn't matter)
user  : write c 2 -> read p 2

With sequential consistency it's guaranteed that once p is updated from 2 to 3, c == 1 cannot be observed since the write of c == 2 precedes the read of p == 2 which must happen before the write of p = 3. However, with acq-rel there's no meaningful causal relation here at all: the relationship is only established on reads, and the read of c == 1 nor p == 2 only builds causal relationship for operations happened before the first operation in the above diagram.

And what this means is that we need a SeqCst fence after both sides of the write. For the userspace, we use a SeqCst store; but the kernel space might seem doomed. Luckily, it actually does a xchg operation to update the BUSY flag which implicitly serves as a full SeqCst barrier. Therefore, the kernel side doesn't actually need to be changed; but rather unintentionally, since the comments only suggest acq-rel relationship which is insuffcient.

ishitatsuyuki avatar Jun 02 '22 03:06 ishitatsuyuki

A bunch of updates — documentation for aya-bpf, a reserve/commit API that is actually safe, and addressing review comments about process_ring error handling, plus the fix (+ a small explanation) for the awful coherence issue that I tried to figure out the right solution for days. I'll open a separate PR for the codegen soon.

ishitatsuyuki avatar Jun 03 '22 09:06 ishitatsuyuki

Updated RingBufEntry to consume self on discard/submit.

ishitatsuyuki avatar Jun 03 '22 10:06 ishitatsuyuki

Suppressed a clippy lint, see code for details.

ishitatsuyuki avatar Jun 03 '22 10:06 ishitatsuyuki

@ishitatsuyuki codegen changes are in main so feel free to rebase to pick those up when you're ready. will take a look at the updates and see if I've got any review comments to add, but I think we're waiting on @alessandrod to take a look (at this and #290) before this is ok to merge.

dave-tucker avatar Jun 03 '22 10:06 dave-tucker

Using mpsc necessitates a copy — ring processing speed isn't really the concern here, as the application has the freedom to size the ring buffer. (Also mpsc means double ring buffering — not sure that's on point.)

The only other zero-copy option would be returning a guard object that updates consumer pointer on Drop — but that's not much better I'd say.

Having thought on this more, I agree - you're right. Different applications are going to have different requirements, some which can be met by sizing the ringbuf, data notification control etc.. Either way, the API as it stands seems flexible enough that I could do whatever I want.

dave-tucker avatar Jun 03 '22 10:06 dave-tucker

Rebased now that the &self PR is merged.

ishitatsuyuki avatar Jun 09 '22 16:06 ishitatsuyuki

Also forgot to say in the review: I think we should implement the async bits of this, like we do for perf buffers

alessandrod avatar Jun 13 '22 20:06 alessandrod

On async impl: For the callback model there's not much thing we can do inside aya to support an async model (see https://github.com/aya-rs/aya/pull/294#issuecomment-1143375108); though, if we move to an iterator-like model then it probably make sense to do that.

ishitatsuyuki avatar Jun 18 '22 02:06 ishitatsuyuki

@ishitatsuyuki Hi! Do you have time to address the remaining comments? If not, I would be happy to take over.

vadorovsky avatar Jul 25 '22 20:07 vadorovsky

I'll try to get a revision on this later today.

ishitatsuyuki avatar Jul 26 '22 01:07 ishitatsuyuki

Sorry, ran out of time today but will prioritize this tomorrow.

ishitatsuyuki avatar Jul 26 '22 15:07 ishitatsuyuki

I'm getting really busy with internship kicking in in August so I will prefer someone to take over this. Sorry for the delay and for missing the promise.

I investigated implementing an iterator API, but doing this naively involves more overhead since it doesn't allow caching the producer index and results in a redundant cache miss due to how cache coherency works. It's probably not hard to cache this though, but I haven't tried that yet.

ishitatsuyuki avatar Aug 06 '22 14:08 ishitatsuyuki

No worries, I will try to take this over.

vadorovsky avatar Aug 06 '22 18:08 vadorovsky

I still don't have time to commit to this but I've made some corrections to the memory ordering used on the host side, which I've found while doing research on the correctness of SPSC queues. The justification can be found in the comments.

No change is expected for compiled code on x86 since loads are always compiled to ordinary loads.

ishitatsuyuki avatar Aug 15 '22 11:08 ishitatsuyuki

Just did a rebase to fix build errors. Since I keep getting nerdsniped by the atomics stuff (just figured out that SeqCst is only required for the test right before you sleep/return None), I'm thinking of finishing this incrementally in my free time. @vadorovsky I hope you don't mind?

I think the review comments that we definitely should address are:

  • Have an iterator like interface. (It looks like the overhead will be negligible if you cache them, machine code wise this is going to be store forwarded or a L1 cache hit. https://github.com/rigtorp/SPSCQueue is a good reference point)
  • Have an async iterator like interface. I think this should come in a follow-up PR, mostly because this PR is already getting too aged, but also because there might be additional design decisions to be made. One of them is batching during polling: for efficiency I find that sleeping for a small amount of time after getting the epoll notification greatly improves efficiency. But I'm not sure what kind of default behavior and what kind of configuration should be implemented.

ishitatsuyuki avatar Aug 24 '22 16:08 ishitatsuyuki

I think the review comments that we definitely should address are:

* Have an iterator like interface. (It looks like the overhead will be negligible if you cache them, machine code wise this is going to be store forwarded or a L1 cache hit. https://github.com/rigtorp/SPSCQueue is a good reference point)

Ah thanks for the link looks interesting

* Have an async iterator like interface. I think this should come in a follow-up PR, mostly because this PR is already getting too aged, but also because there might be additional design decisions to be made. One of them is batching during polling: for efficiency I find that sleeping for a small amount of time after getting the epoll notification greatly improves efficiency. But I'm not sure what kind of default behavior and what kind of configuration should be implemented.

Ack I'm perfectly ok with the async bit being done separately. We need to implement some kind of batching for perf event arrays too, I didn't do it at the time since I wasn't sure what the API to opt-in to batching would look like.

alessandrod avatar Aug 24 '22 17:08 alessandrod

Sorry, after a long period of silence, here it is. I replaced process_ring() completely with a iterator-like API. A usage example is in the doc.

I didn't go as far as implementing Iterator though, since it has to be streaming, although this might be feasible now that GATs are stabilized.

I did test this a while ago, so it probably should work. But looks like right now I need to downgrade my kernel due to BTF changes which I don't have time to do.

ishitatsuyuki avatar Nov 05 '22 13:11 ishitatsuyuki

Just wanted to say that this is really great and highly anticipated work 🚀 What can be done to see this merged and released?

Keep it up!

brevilo avatar Nov 29 '22 20:11 brevilo

Great to see this being taken up again! It will be an important and highly anticipated addition to Aya's feature set!

For reference: here is the upstream author's take on the ringbuf goals, it's similarities to the perfbuf implementation and its advantages over it. I think this should guide how things are fleshed out in Aya as well.

Thanks!

brevilo avatar Jan 23 '23 08:01 brevilo

I'm interested in helping get this over the finish line. What's needed to complete this work?

ajwerner avatar May 15 '23 23:05 ajwerner

I'm interested in helping get this over the finish line. What's needed to complete this work?

It looks like the sync API is in place but the async API still needs works

alessandrod avatar May 16 '23 01:05 alessandrod

I put up https://github.com/aya-rs/aya/pull/629 as a rebase of this change and an integration test. It does not do anything towards an async API, but I agree that we should leave that for a subsequent PR.

ajwerner avatar Jun 19 '23 20:06 ajwerner

@ishitatsuyuki did you ever have an example of a test program that demonstrated the epoll hang? I couldn't exactly follow your arguments enough to justify the fences in code, so I went with memory accesses like in libbpf for https://github.com/aya-rs/aya/pull/629. I'd love to understand when that fails and how.

ajwerner avatar Jul 13 '23 02:07 ajwerner