WIP: Remove dependency on kallsyms with eBPF
I've been trying to improve perf tools' startup time to make working on fixing broken tests/features of perf bearable on NixOs. I've learned that processing /proc/kallsyms is a costly operation; on my ryzen5 system around 100ms just to read through it.
Aside from decreasing how many times kallsyms are read, I started to look into how to remove dependency on kallsyms. From kernel source code kernel/kallsyms.c and printk() documentation https://docs.kernel.org/core-api/printk-formats.html#symbols-function-pointers I learned about special pointer formatting flags in kernel.
Writing a custom kernel module just for the purpose of extracting symbol names didn't feel right to me, so I've tried to use eBPF for that purpose. eBPF programs have these helpers available that are promising:
- bpf_snprintf
- bpf_trace_printk
I've looked into how these two are implemented and found that both of them use bstr_printf underneath. But before any printing is done, the format string must first go through bpf_bprintf_prepare, which disallows certain flags.
Fortunately for us, thanks to Florent Revest https://lore.kernel.org/bpf/[email protected]/ %pB is accepted. We might consider adding that info to bpf-helpers man page?
Overview of the new approach:
- eBPF profiler does low-latency recording of stack frames as it used to;
- Go through all kernel ips recorded in stack traces and save them in bpf_hashmap(K=u64, V=char[MAX_SYM_LEN]) with an empty value for now;
- Call eBPF converter program, which will populate values in the hashmap with bpf_snprintf(%pB);
- Report recorded callstacks using the now-populated hashmap for ksyms.
In order to reliably trigger the converter program, I decided to use USDT.
Running time ./profile -F 2344 1 on a mostly idle system I got Before:
real 0m1,215s
user 0m0,058s
sys 0m0,157s
After: real 0m1,045s user 0m0,009s sys 0m0,026s
I ask the community here for your opinion, help and guidance to make this mergeable.
Using %pB slightly changes the format of a symbol name. Example: kmem_cache_alloc_noprof+0x2cf/0x300 It would be trivial to remove the suffix if it's necessary. Generating flamegraphs with Brendan Gregg's perl script still works.
For now, max_entries of the hashmap is hardcoded. Would you make it dynamic, like stack-storage-size or compute it by collecting all ips into a set and take this set's size as value for max_entries?
In Makefile: When adding bpf/usdt.bpf.h clang errored it couldn't find asm/errno.h so I added -v to cflags when V is set. As a quick fix, I hardcoded include path to x86_64 host header files. In contrast to clang, gcc corretly has this in it's default include path. Do you have any idea how to set this path, preferably so it doesn't only work on ubuntu?
Experimentally I added BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID to bpf_get_stackid. It doesn't seem to break the program and should make recording faster. If somebody knows why adding them is a bad idea please do tell.
If this gets a positive reaction, I will look into converting other tools here to use eBPF instead of /proc/kallsyms when possible.
Best, Krzysztof
I've watched now Evolution of stack trace captures with BPF - Andrii Nakryiko Originally posted by @ekyooo in https://github.com/iovisor/bcc/pull/5181#discussion_r1909752691
I think we should consider keeping BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID since for some users latency of profiling may be more important and if hash collision happens we are losing data anyway.
Perhaps creating an option --fast, where the possibility of overwriting data on hash colisions is mentioned in it's description. Perhaps make this the default behavior and make a flag to turn it off.