Netstacklat: Add filtering and grouping
Hi @tohojo and @netoptimizer. This is only draft PR for now to facilitate some discussion. For a real PR I imagine we break this down into some smaller chunks (this is 4 different branches stacked on top of each other). Each of those essentially deals with:
- Change the socket enqueue hook because the one I used had some issues
- Multiplex all histograms into a single hashmap (needed for the grouping support, and is generally more convenient)
- Additional filtering options (ifindex, network namespace and cgroup. TODO: add filter for non-empty socket rx queues)
- Groupby support (ifindex and cgroups)
So I do not expect you to review this in detail as of now (although if you have nothing better to do feel free to do so, do not have any major cleanup planned). I would however like some feedback on some design decisions and the limitations they bring. The design decisions/limitations that I would like you to consider are:
-
For multiplexing histograms:
- In userspace I maintain a sorted array of the histogram keys to allow relatively fast lookup via
bsearch. Insertion is not very efficient (O(n)), but should be manageable as long as we do not have more than few thousand histograms (will do some more performance tests). - If the BPF hashmap runs out of space (which may happen with the groupby options) it will simply not create any new entries. This may result in it keeping partial histograms, where some buckets are miss-reported as being empty. I do not attempt to evict any existing entry as I do not really have any good grounds to do that on. LRU could be an option, but Jesper mentioned they do not work too well with many CPUs.
- If userspace runs out of space for histogram keys (which may happen with groupby), it will warn that it's missing some histograms collected by the eBPF programs, but keep running.
- In userspace I maintain a sorted array of the histogram keys to allow relatively fast lookup via
-
For ifindex filtering:
- I do not support ifindex > 16384 (can be bumped up a bit, but not practical to support >> 1000000 (see commit message for b05cd3837 - "netstacklat: Add filtering for network interfaces")
- I only check the ifindex, I do not jointly consider ifindex - namespace combinations (only really intend you to track one namespace, see next point)
-
For network namespace filtering:
- By default I set netstacklat to filter for the same network namespace as you run the tool in (which will probably be the root network namespace)
- I only support filtering for a single network namespace (see message for eb291bdf - "netstacklat: Add filtering for network namespace"). However, you can disable the filtering to get data for all network namespaces.
-
For groupby options:
- You supply each groupby option as it's own command-line argument (i.e.
--groupby-interfaceand--groupby-cgroup). It would perhaps be nicer to have--groupby interface,cgroup, but I find that adds needless complexity as long as we only have 2 options. For now it's unlikely I'll add many more groupby options, as the amount histograms you get for all combinations quickly explode. - For
--groupby-interfacethe user space agent prints out the ifindex, not the interface name (see message for d42cdc20 - "netstacklat: Add option to groupby interface"). The provided ebpf-exporter config attempts to print out the interface names (but I assume that will not work as intended if the ifindex is from a different namespace). - For
--groupby-cgroupthe user space agent prints out the cgroup ID (inode), not the path (see message for cf7db6f593 - "netstacklat: Add option to groupby cgroup"). The provided ebpf-exporter config prints out the paths.
- You supply each groupby option as it's own command-line argument (i.e.
Further details can be found in each commit message.
@netoptimizer There's currently (AFAIK) no convenient way to provide the various values to filter for with ebpf-exporter. If PR 531 for ebpf-exporter is merged it seems like you would be able to provide a cgroup path regex in the YAML config, but that still leaves filtering on PIDs, interfaces and network namespace. The network namespace can be hardcoded in the config in the eBPF source code, but the rest need to be set in BPF maps (after their corresponding filter_xxx member in the config in the eBPF source has been set to true). That can be done externally with bpftool. Can hack up some shell script that does that if you want.
I've now added the filter for non-empty socket rx-queue that Jesper requested as well. See commit 7dc64ad19 - netstacklat: Add filtering for non-empty rxqueue.
I've rebased this one the updated main branch (after @tohojo changed the compilation process) and it still seems to compile correctly (it works on my machine =)).
Updated status from draft -> ready for review as @netoptimizer seemed willing to potentially review it in its current state (i.e. without breaking this up into a bunch of separate PRs).
The latest force push changed the value type in all filter maps from u8 to u64 to allow the cgroup filtering to work together with the cgroup_id_map introduced to ebpf-exporter in PR cloudflare/ebpf_exporter#531. Also updated the YAML config file to demonstrate how the cgroup filtering can be used with ebpf-exporter.
Hmm, I now also saw that there was a PR cloudflare/ebpf_exporter#565 that has also been merged. That PR adds support for using a cgroup storage map (BPF_MAP_TYPE_CGRP_STORAGE) instead of a hashmap to keep track of a set of desired cgroups.
So that way you can switch out the somewhat expensive hash map lookup to a (hopefully faster) cgroup storage lookup. So might be worthwhile to consider migrating the cgroup filtering support to use a cgroup storage map instead, although that will limit it to kernels >= v6.2.
Hi @simosund I've started to convert this into ebpf_exporter, and I've experimented with this on a prod experiment server. See my progress via https://github.com/netoptimizer/ebpf_exporter/pull/1
I notice that if we have disabled network_ns filtering, the code still does a lookup of the current name space via calling get_network_ns() as input to the filter.
See how I changed the code here to avoid this:
- https://github.com/netoptimizer/ebpf_exporter/pull/1/commits/b9c4570e74a5d4c
I noticed another bug.
Sockets also have a sk_backlog queue, so the check for non-empty filter_nonempty_sockqueue also need to check if there are any packets left on the backlog queue.
See how I changed the code here:
- https://github.com/netoptimizer/ebpf_exporter/pull/1/commits/780984dc5db38854f1e
@netoptimizer Thanks for reporting these more generally applicable fixes, will make sure to "backport" them here.
Sockets also have a sk_backlog queue, so the check for non-empty filter_nonempty_sockqueue also need to check if there are any packets left on the backlog queue.
In general I think it probably makes sense to just replace the filter_nonempty_socketqueue with the more general filter_min_queue_len you've implemented, as that gives more granular control (and setting filter_min_queue_len to 1 should be equivalent to enabling filter_nonempty_sockqueue).
Have addressed @tohojo's comments and fixed the bugs @netoptimizer pointed out here. Additionally, I completely replaced the filter for non-empty socket queues with @netoptimizer's more general socket queue length filter. The later provides more flexibility without being considerably more complicated, and I see no point in having both (as the later should offer a superset in functionality compared to the former).
As some of that code is heavily based on @netoptimizer's code I've added signed-by tags from him as well, but I'm not sure if that's the appropriate way to credit someone in the commits. So let me know what the proper git manners are.
Multiple signed-off-by signifies that you took a patch that Jesper wrote and simply applied it without making any changes (except possibly minor ones). If you instead wrote some code based on what Jesper wrote (but you still made changes yourself so it's no longer the same patch), the convention is to use a Co-developed-by tag instead...
Multiple signed-off-by signifies that you took a patch that Jesper wrote and simply applied it without making any changes (except possibly minor ones). If you instead wrote some code based on what Jesper wrote (but you still made changes yourself so it's no longer the same patch), the convention is to use a Co-developed-by tag instead...
Ok. So in the first instance I've taken the fix from @netoptimizer at https://github.com/netoptimizer/ebpf_exporter/commit/b9c4570e74a5d4c pretty much exactly as it is. But that is a fix to a version of my code that no longer exists here (as I rolled in the fix directly in the same commit introducing that functionality). And the fix only makes up a part of the overall code in that commit. So guess that's closer to Co-developed-by then.
In the second case, the commit that adds the queue length filter, the whole eBPF-side code is pretty much ripped from @netoptimizer (although not from a single commit, but rather the joint result of https://github.com/netoptimizer/ebpf_exporter/pull/1/commits/28c37132c25197930da39dccd7ea49956e481355, https://github.com/netoptimizer/ebpf_exporter/pull/1/commits/780984dc5db38854f1ef7c05bcad39f161216794 and https://github.com/netoptimizer/ebpf_exporter/pull/1/commits/f8671e35e40c394f75502d4c852321abd2dc1702) with just some minor changes to naming. I have however still added the (rather trivial in this case) user space code as needed for this stand-alone version of the tool. So still not just applying a patch from @netoptimizer, but his code makes up the key part of that commit.
Changed both tags to Co-developed-by in this case. Should maybe also wait for a thumbs up from @netoptimizer that this updated version of the PR suits his needs. The version before the review and associated changes can be found at https://github.com/simosund/bpf-examples/tree/netstacklat-groupby-backup20250829 for anyone who wants to do a diff.
Alright, cool. @netoptimizer PTAL, and feel free to merge when you're satisfied with the results :)
Looking through diff between branches (on cmdline) and things generally Looks Good To Me :-)
git diff netstacklat-groupby-backup20250829..netstacklat-groupby
I'm okay with renaming user_config from filter_min_queue_len to filter_min_sockqueue_len
- as the filter only works for sockets
- I'll adjust this in my conversion to ebpf_exporter
- You kept the function name as
filter_min_queue_len()was this on purpose?
I'm okay with renaming
user_configfromfilter_min_queue_lentofilter_min_sockqueue_len* as the filter only works for sockets * I'll adjust this in my conversion to ebpf_exporter * You kept the function name as `filter_min_queue_len()` was this on purpose?
I will admit that I haven't thought too deeply about this. Figured there might be some other queue (e.g. the hardware RX queue) or something that could be relevant, so figured being slightly more specific wouldn't hurt. Didn't think of renaming the function name for whatever reason, but should probably do that to be consistent.
Okay, to be consistent let us also rename the function :-)
Last force-push renamed the function filter_min_queue_len() -> filter_min_sockqueue_len()
Great I'm going to merge this! :-)