Extend ebpf perf functions
Currently, the perf module has a requirement to retrieve all available data from multiple kernel perf reader ring buffers in a single operation. Therefore, we would like to extend the functionality to support this. add functions:
// ReadAllRings iterates through all ready rings and reads events,
// similar to reader_event_read.
ReadAllRings()
// EpollWait wraps the epoll waiting logic and allows specifying the timeout as needed.
EpollWait(d time.Duration)
Thanks for your contribution. Some questions:
// ReadAllRings iterates through all ready rings and reads events,
// similar to reader_event_read.
ReadAllRings()
Can you elaborate why using (*Reader) Read() in a loop does not work for you.
// EpollWait wraps the epoll waiting logic and allows specifying the timeout as needed.
EpollWait(d time.Duration)
Can you also elaborate on this suggestion? Can you maybe share how you intend to use the perf buffer? and why the current API limits you?
Thanks for your contribution. Some questions:
// ReadAllRings iterates through all ready rings and reads events, // similar to reader_event_read. ReadAllRings()Can you elaborate why using
(*Reader) Read()in a loop does not work for you.// EpollWait wraps the epoll waiting logic and allows specifying the timeout as needed. EpollWait(d time.Duration)Can you also elaborate on this suggestion? Can you maybe share how you intend to use the perf buffer? and why the current API limits you?
I need an externally controlled batch-processing model:
call EpollWait(d) with my own timeout, then read all ready rings at once and process them in batches. This reduces system calls, minimizes wakeups, and lets me fully control the scheduling logic.
ReadInto, however, is an internally driven, record-by-record model. It automatically performs Wait, manages its own state machine (pendingErr, epollRings), and prevents me from controlling the waiting strategy or batch-processing flow. Therefore, it doesn’t fit my high-throughput continuous sampling scenario.
If there’s anything wrong with the way I implemented this, please feel free to let me know at any time:)
@liuchangyan Have you considered using a bpf ringbuf instead of a perf event map? It's much less complicated to consume from user space since there's only 1 ring, which also makes it scale better on nodes with large amounts of CPUs.
It has a built-in wakeup scheduler/coalescing algorithm that should fit 99% of use cases, and should address your batch processing concern.
Also, if you use bpf_ringbuf_reserve + bpf_ringbuf_commit, there's less copying needed on the bpf side and less stack pressure, which should all improve the performance and reduce CPU usage of your program.
bpf_ringbuf_reserve
This feature is only available on Linux 5.x kernels, while most of our current use cases are on Linux 4.x kernels. cc @florianl
If I understand correctly the library already does the things you want.
- SetDeadline allows you to control the wait duration.
- ReadInto already opportunistically polls all rings: https://github.com/cilium/ebpf/blob/f150ced93791ac9bd374e520c273093f2f652b41/perf/reader.go#L380-L385
Please note that 4.x series kernels are not supported by this library anymore.
If I understand correctly the library already does the things you want.
- SetDeadline allows you to control the wait duration.
- ReadInto already opportunistically polls all rings: https://github.com/cilium/ebpf/blob/f150ced93791ac9bd374e520c273093f2f652b41/perf/reader.go#L380-L385
Please note that 4.x series kernels are not supported by this library anymore.
But it cannot return all events from all ring buffers in a single call for me to process,as I mentioned in my comment above:it doesn’t fit my high-throughput continuous sampling scenario.
Subsequent calls to ReadInto will not Wait() again until the rings are empty. The only overhead I can think of calling into ReadInto multiple times is lock contention on the mutex, which seems unlikely to me if you have a single reader.
My point of view is that the proposed API is pretty intrusive because it exposes a lot of the internal. I also don't understand where the bottleneck is supposed to be. If there is a performance problem it would be great to have a reproducer plus a pprof showing what needs fixing before changing the API.
@florianl if you agree I'd suggest you close the PR until we get something more concrete.
I agree with @lmb.