Opportunity: bpf iterators for massive speed-up?
On a linux system with around 250 Procs it took the procfs implemention 5.45ms vs 75.6us for bpf (bpf is ~72x faster). On a linux system with around 10,000 Procs it took the procfs implemention ~296ms ms vs 3ms for bpf (bpf is ~100x faster).
The performance difference comes from several key factors. Procfs reading requires multiple system calls per process. BPF iterators use just one system call regardless of process count. Each system call involves user/kernel context switches. BPF iterators eliminate this per-process overhead by running entirely in kernel space until completion. Also BPF programs access kernel data structures directly, this eliminates file system overhead and buffer copying.
https://www.grant.pizza/blog/bpf-iter/
What's the baseline kernel version required for this to work?
Can this be implemented without access to the kernel headers of the target system?
Seems bpf iterators were introduced in v5.7-rc2-1172-gde4e05cac46d .. and the blog test code seems compilable without extra headers..
Looking at the demo there is a vmlinux.h, which is generated with bpftool from the running kernel via a shell script that boils down to basically bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h. There is a raw mode that outputs some (machine-readable???) version of that information, but still you essentially have dynamic code that needs to be built tailored to the kernel on which you want to execute the BPF program.
Now with this resolved, we are facing two main issues:
- Compiling the BPF program in a safe manner while (potentially) running as root ourselves
- Loading the compiled BPF program into the kernel (which requires root AFAIR¹).
As htop needs to run without root permissions, this is DoA based on the second account right from the get-go.
On the other hand, ignoring the permission issue for now, let's look at generating the eBPF bytecode. The easiest way would be using GCC or Clang to compile some C code directly to eBPF, which is basically the way that the example takes (just not on-the-fly). OTOH if you really simplify the serializer logic to a bare minimum, it's effectively a glorified memcpy with side quests. with all the offsets available in the BTF/CO-RE data you can read from the /sys/kernel/btf/vmlinux endpoint the kernel MAY provide (AFAIR this is an optional feature for a kernel to have). So we could build a basic generator that was:
- Parsing the BTF information from the running kernel
- Do some pre-processing on the serialization patterns, thus creating an "action plan"
- Generate a list of glorified
memcpys using eBPF bytecode, keeping track of possible relocations - Generating the metadata and relocation information for the BPF loader
- Load the abomination of auto-generated kot² into the kernel (hoping it gets accepted by the verifier)
- …
- Profit
On the userland side, we'd then use the collected data from step 2 to de-serialize the generated data stream into some structure that htop understands. This means that to be compatible across kernel versions htop would need to track which fields in the task and vma structures map to its internal fields in order to look up the information in the de-serialized structures. This will incur a noticeable memory footprint as the full in-kernel structures will be replicated inefficiently for user-space plus additionally into htop's native structures.
So, to summarize:
- Possible in principle
- Will have a dependency on the running kernel
- Will either use external tools at compile-time or runtime AND will need to replicate non-trivial parsing logic
- Will need
CAP_BPFor otherwise elevated privileges (short: security nightmare) - Will still require a fallback to current platform support if a kernel does not support eBPF, the eBPF iterator fails to load, the user does not have sufficient privileges OR Hell has frozen over³
- Will cause a significant increase in htop's resource usage
I really like the basic premise (and actually already had a different idea regarding BPF watch-points/probes for tracking process creation/termination, cf. #266), but given the circumstances surrounding eBPF it's practically speaking a portability and security nightmare if you are trying to get it into production code.
That said: I'd be glad to be proven wrong by some PR containing a viable implementation of the idea. ;-)
¹actually CAP_BPF, but close enough …
²German false cognate
³I heard that sometimes happens in winter