aya: Implement RingBuf
Based on @willfindlay's branch, fixed the bugs I found and it appears to be working fine for my setup.
The original branch took a callback as a part of the struct, but since this requires 'static lifetime it could be annoying. Instead, take it as the process_ring argument so it only needs to outlive the process call.
Branch based on #290.
Closes: #12 Closes: #201
Deploy Preview for aya-rs-docs ready!
Built without sensitive environment variables
| Name | Link |
|---|---|
| Latest commit | 52b4f44e3e8a1c24ff2e4f6345469c1992e16192 |
| Latest deploy log | https://app.netlify.com/sites/aya-rs-docs/deploys/6366735d9368030008590b3e |
| Deploy Preview | https://deploy-preview-294--aya-rs-docs.netlify.app |
| Preview on mobile | Toggle QR Code...Use your smartphone camera to open QR code link. |
To edit notification comments on pull requests, go to your Netlify site settings.
Doing async with tokio is a little bit more verbose, but I'm doing it like this:
loop {
let mut guard = ring.readable_mut().await?;
guard.get_inner_mut().process_ring(&mut |e| {
// Do things
Ok(())
})?;
guard.clear_ready();
}
Looks like this can get stuck with epoll (miss the notification somehow). I'll investigate.
Throwing SeqCst at the consumer pos store fixes this.
unsafe { (*self.consumer_pos_ptr).store(consumer_pos, Ordering::SeqCst) };
I'm not very good at reasoning with concurrency, but it's possible that the libbpf implementation has bogus synchronization. Cilium (Go) defaults to SeqCst, so it won't see the same issue.
EDIT: it might be due to tokio migrating the read function across thread too, while libbpf essentially assumes single thread polling.
Thanks for your review.
I think that in async version of this code we should - instead of using callback - use the the tx side of an mpsc channel.
Using mpsc necessitates a copy — ring processing speed isn't really the concern here, as the application has the freedom to size the ring buffer. (Also mpsc means double ring buffering — not sure that's on point.)
The only other zero-copy option would be returning a guard object that updates consumer pointer on Drop — but that's not much better I'd say.
Thanks for taking this over!
Re: the "stuck" polling issue, it's an actual upstream flaw. The ringbuf claims to be designed around acquire-release ordering but it's actually relying on sequential consistency implicitly in the kernel space, while the userspace counterpart is simply busted. Consider the following simplified example, where the "reserve" and "commit" operation is fused and simply increments the producer counter. A notification is sent if and only if the incremented producer counter == consumer counter + 1.
kernel: write p 2 -> read c X -> write p 3 -> read c 1 (X doesn't matter)
user : write c 2 -> read p 2
With sequential consistency it's guaranteed that once p is updated from 2 to 3, c == 1 cannot be observed since the write of c == 2 precedes the read of p == 2 which must happen before the write of p = 3. However, with acq-rel there's no meaningful causal relation here at all: the relationship is only established on reads, and the read of c == 1 nor p == 2 only builds causal relationship for operations happened before the first operation in the above diagram.
And what this means is that we need a SeqCst fence after both sides of the write. For the userspace, we use a SeqCst store; but the kernel space might seem doomed. Luckily, it actually does a xchg operation to update the BUSY flag which implicitly serves as a full SeqCst barrier. Therefore, the kernel side doesn't actually need to be changed; but rather unintentionally, since the comments only suggest acq-rel relationship which is insuffcient.
A bunch of updates — documentation for aya-bpf, a reserve/commit API that is actually safe, and addressing review comments about process_ring error handling, plus the fix (+ a small explanation) for the awful coherence issue that I tried to figure out the right solution for days. I'll open a separate PR for the codegen soon.
Updated RingBufEntry to consume self on discard/submit.
Suppressed a clippy lint, see code for details.
@ishitatsuyuki codegen changes are in main so feel free to rebase to pick those up when you're ready.
will take a look at the updates and see if I've got any review comments to add, but I think we're waiting on @alessandrod to take a look (at this and #290) before this is ok to merge.
Using mpsc necessitates a copy — ring processing speed isn't really the concern here, as the application has the freedom to size the ring buffer. (Also mpsc means double ring buffering — not sure that's on point.)
The only other zero-copy option would be returning a guard object that updates consumer pointer on Drop — but that's not much better I'd say.
Having thought on this more, I agree - you're right. Different applications are going to have different requirements, some which can be met by sizing the ringbuf, data notification control etc.. Either way, the API as it stands seems flexible enough that I could do whatever I want.
Rebased now that the &self PR is merged.
Also forgot to say in the review: I think we should implement the async bits of this, like we do for perf buffers
On async impl: For the callback model there's not much thing we can do inside aya to support an async model (see https://github.com/aya-rs/aya/pull/294#issuecomment-1143375108); though, if we move to an iterator-like model then it probably make sense to do that.
@ishitatsuyuki Hi! Do you have time to address the remaining comments? If not, I would be happy to take over.
I'll try to get a revision on this later today.
Sorry, ran out of time today but will prioritize this tomorrow.
I'm getting really busy with internship kicking in in August so I will prefer someone to take over this. Sorry for the delay and for missing the promise.
I investigated implementing an iterator API, but doing this naively involves more overhead since it doesn't allow caching the producer index and results in a redundant cache miss due to how cache coherency works. It's probably not hard to cache this though, but I haven't tried that yet.
No worries, I will try to take this over.
I still don't have time to commit to this but I've made some corrections to the memory ordering used on the host side, which I've found while doing research on the correctness of SPSC queues. The justification can be found in the comments.
No change is expected for compiled code on x86 since loads are always compiled to ordinary loads.
Just did a rebase to fix build errors. Since I keep getting nerdsniped by the atomics stuff (just figured out that SeqCst is only required for the test right before you sleep/return None), I'm thinking of finishing this incrementally in my free time. @vadorovsky I hope you don't mind?
I think the review comments that we definitely should address are:
- Have an iterator like interface. (It looks like the overhead will be negligible if you cache them, machine code wise this is going to be store forwarded or a L1 cache hit. https://github.com/rigtorp/SPSCQueue is a good reference point)
- Have an async iterator like interface. I think this should come in a follow-up PR, mostly because this PR is already getting too aged, but also because there might be additional design decisions to be made. One of them is batching during polling: for efficiency I find that sleeping for a small amount of time after getting the epoll notification greatly improves efficiency. But I'm not sure what kind of default behavior and what kind of configuration should be implemented.
I think the review comments that we definitely should address are:
* Have an iterator like interface. (It looks like the overhead will be negligible if you cache them, machine code wise this is going to be store forwarded or a L1 cache hit. https://github.com/rigtorp/SPSCQueue is a good reference point)
Ah thanks for the link looks interesting
* Have an async iterator like interface. I think this should come in a follow-up PR, mostly because this PR is already getting too aged, but also because there might be additional design decisions to be made. One of them is batching during polling: for efficiency I find that sleeping for a small amount of time after getting the epoll notification greatly improves efficiency. But I'm not sure what kind of default behavior and what kind of configuration should be implemented.
Ack I'm perfectly ok with the async bit being done separately. We need to implement some kind of batching for perf event arrays too, I didn't do it at the time since I wasn't sure what the API to opt-in to batching would look like.
Sorry, after a long period of silence, here it is. I replaced process_ring() completely with a iterator-like API. A usage example is in the doc.
I didn't go as far as implementing Iterator though, since it has to be streaming, although this might be feasible now that GATs are stabilized.
I did test this a while ago, so it probably should work. But looks like right now I need to downgrade my kernel due to BTF changes which I don't have time to do.
Just wanted to say that this is really great and highly anticipated work 🚀 What can be done to see this merged and released?
Keep it up!
Great to see this being taken up again! It will be an important and highly anticipated addition to Aya's feature set!
For reference: here is the upstream author's take on the ringbuf goals, it's similarities to the perfbuf implementation and its advantages over it. I think this should guide how things are fleshed out in Aya as well.
Thanks!
I'm interested in helping get this over the finish line. What's needed to complete this work?
I'm interested in helping get this over the finish line. What's needed to complete this work?
It looks like the sync API is in place but the async API still needs works
I put up https://github.com/aya-rs/aya/pull/629 as a rebase of this change and an integration test. It does not do anything towards an async API, but I agree that we should leave that for a subsequent PR.
@ishitatsuyuki did you ever have an example of a test program that demonstrated the epoll hang? I couldn't exactly follow your arguments enough to justify the fences in code, so I went with memory accesses like in libbpf for https://github.com/aya-rs/aya/pull/629. I'd love to understand when that fails and how.