bpftrace [RFC] Probe expansion in codegen

While working on #2334, I realized that we'll need to significantly change the way we do probe expansion, so I'm opening an RFC to see what other people opinions are, before I start implementing it.

Let us have a simple wildcarded probe kfunc:vfs_* { ... }.

At the moment, we generate one LLVM function, LLVM generates one BPF program from it, then we perform the expansion (74 probes), load the BPF program 74 times (each time with a different BTF id of the probe), and attach each instance to a different probe.

The problem is that if we delegate probe loading to libbpf, it will need to discover the probes from the ELF object and therefore we'll need to generate 74 copies of the same LLVM function (unless we somehow force LLVM to create multiple symbol table entries for the same BPF function). This will heavily enlarge the codegen output and the size of the ELF file.

My stand is that this is still worth it as moving to libbpf will have several advantages:

less code (e.g. no need for custom relocations),
possibility to use libbpf's attachment in future (removing dependency on BCC),
a "standard" ELF produced by bpftrace, possibly usable by other loaders other than libbpf,
access to other libbpf features.

The codegen/ELF size itself is a hidden technical detail and it'll probably only cause trouble for debugging. Also, for some probes (e.g. kprobe), we'll be ok with just a single LLVM function but for others (like k(ret)func), we'll always need to do the expansion.

Feb 14 '24 12:02 viktormalik

we generate one LLVM function

LLVM generates one BPF program from it

we perform the expansion (74 probes)

load the BPF program 74 times (each time with a different BTF id of the probe)

attach each instance to a different probe

What do you mean by "perform the expansion" in step 3? I'm not familiar with this code - are we copying the bytecode currently?

My concern with unnecessary probe expansion is the performance impact if we want to attach to 100k+ probes (e.g. fentry:*, fexit:* {}). Generating multiple symbols for the same function seems like something that should be possible to me.

Feb 14 '24 18:02 ajor

we generate one LLVM function

LLVM generates one BPF program from it

we perform the expansion (74 probes)

load the BPF program 74 times (each time with a different BTF id of the probe)

attach each instance to a different probe

What do you mean by "perform the expansion" in step 3? I'm not familiar with this code - are we copying the bytecode currently?

We're not copying it directly but we create one Probe (from types.h) object per expanded probe (see bpftrace::add_probe), then call bpf_prog_load for each, and the copy is done in the kernel upon loading.

My concern with unnecessary probe expansion is the performance impact if we want to attach to 100k+ probes (e.g. fentry:*, fexit:* {}).

In reality, attaching to such a large number of fentry probes is already terribly slow (it's caused by the kernel, not bpftrace):

# time src/bpftrace -e 'kfunc:vfs_* { @[func] = count() } i:ms:1 { exit() }'
Attaching 75 probes...
[...]
real	0m20.172s
user	0m0.644s
sys	0m0.982s

# time src/bpftrace -e 'kfunc:cpu* { @[func] = count() } i:ms:1 { exit() }'
Attaching 375 probes...
[...]
real	1m38.768s
user	0m2.197s
sys	0m2.964s

Also remember that there's a limit of 512 probes which we have (can be lifted by setting an env variable).

All in all, attaching to a huge number of kfuncs is not practical and it's not their main use-case in the first place. The only other probe types which could use such a large number of attach points are kprobes and uprobes, and here we could use kprobe_multi and uprobe_multi link types and generate just a single LLVM function.

Generating multiple symbols for the same function seems like something that should be possible to me.

I agree but we'd still rely on libbpf to do the program collection by iterating the symbol table. If that ever changes (IMHO it's very unlikely), we'd have to adapt. Also, we can always add this if we find that there are performance issues with the full expansion approach.

Feb 16 '24 13:02 viktormalik

Trying to digest this a bit. It seems (as per @viktormalik 's point) that perhaps the only real concern here is around expansion of kfunc/kretfunc as we can use the "multi" variants for the kprobes/uprobes. If attaching to this many kfuncs is an anti-pattern, of sorts, I'm fine to do the un-optimized (copies of the same LLVM function in the ELF file) if that's easier and (perhaps?) more future proof then messing around with the symbol table. We can also issue warnings to the user about both the size of the ELF file and the number of attached kfuncs (perhaps encouraging the use of kprobes in that situation). All that said, I don't feel strongly.

Feb 16 '24 16:02 jordalgo

Trying to digest this a bit. It seems (as per @viktormalik 's point) that perhaps the only real concern here is around expansion of kfunc/kretfunc as we can use the "multi" variants for the kprobes/uprobes.

The only problem is that the "multi" variants are rather new and therefore won't be supported on older kernels. Still, the 512 probe limit would hit on those kernels so we shouldn't get an ELF with thousands of copies of a BPF function.

If attaching to this many kfuncs is an anti-pattern, of sorts, I'm fine to do the un-optimized (copies of the same LLVM function in the ELF file) if that's easier and (perhaps?) more future proof then messing around with the symbol table.

I agree, unless the compiler has a "standard" way to do that. I haven't found any, yet.

We can also issue warnings to the user about both the size of the ELF file and the number of attached kfuncs (perhaps encouraging the use of kprobes in that situation). All that said, I don't feel strongly.

Feb 19 '24 06:02 viktormalik

Attaching to a huge amount of fentry probes isn't currently possible due to the kernel's performance as you said. It is something that users want to do and should be able to do though, so we need to keep it in mind for whenever a kernel fix comes along. This is something that @tyroguru is interested in.

It's the probe detach which is slow rather than the attach, if that makes any difference (try with this script: fentry:vfs_* { } BEGIN { print("begin"); } END { print("end") }).

Duplicate Symbols

I can create duplicate symbols for functions with Clang, so there must be an interface for doing this in libLLVM:

void bar()  {}
asm("asdf:");
asm(".globl asdf");
void foo() {}

0000000000000000 T bar
0000000000000007 T asdf
0000000000000007 T foo

Maybe MCContext::getOrCreateSymbol? https://llvm.org/doxygen/classllvm_1_1MCContext.html#ac11eef690074972378846024abbe8722

libbpf

It looks like retsnoop is doing something special for mass attaching to fentries, but I suppose it has the requirement of being compiled ahead of time: https://github.com/anakryiko/retsnoop/blob/2d730d468719ed35d0f3bc2dbc958bd90f31342e/src/mass_attacher.c#L510-L528

Pinging @anakryiko for any input on using libbpf.

Feb 19 '24 15:02 ajor

It looks like retsnoop is doing something special for mass attaching to fentries, but I suppose it has the requirement of being compiled ahead of time: https://github.com/anakryiko/retsnoop/blob/2d730d468719ed35d0f3bc2dbc958bd90f31342e/src/mass_attacher.c#L510-L528

Pinging @anakryiko for any input on using libbpf.

There is nothing that retsnoop or libbpf can do to speed up attachment/detachment of fentry/fexit BPF programs, unfortunately. Kernel doesn't support single shot multi-attachment for them (there were discussions but it never got implemented). The piece you linked is just preparing few different copies of programs, depending on number of arguments. This is done to let libbpf perform relocations and all other adjustments, so that retsnoop can just grab raw BPF instructions and clone them for each btf_id (see clone_prog(), https://github.com/anakryiko/retsnoop/blob/2d730d468719ed35d0f3bc2dbc958bd90f31342e/src/mass_attacher.c#L977). So fentry/fexit mode is supported by retsnoop, but it's slow, with its own limitations, and definitely not the preferred mode. It does have advantages in some situations (fentry pollutes LBR entries much less compared to kprobes).

It's very different for kprobe/kretprobe. Retsnoop by default will use multi-kprobes and will be able to attach to thousands of programs almost instantaneous. With just one program for entry and one for exit programs.

Feb 19 '24 19:02 anakryiko

There is nothing that retsnoop or libbpf can do to speed up attachment/detachment of fentry/fexit BPF programs, unfortunately. Kernel doesn't support single shot multi-attachment for them (there were discussions but it never got implemented). The piece you linked is just preparing few different copies of programs, depending on number of arguments. This is done to let libbpf perform relocations and all other adjustments, so that retsnoop can just grab raw BPF instructions and clone them for each btf_id (see clone_prog(), https://github.com/anakryiko/retsnoop/blob/2d730d468719ed35d0f3bc2dbc958bd90f31342e/src/mass_attacher.c#L977). So fentry/fexit mode is supported by retsnoop, but it's slow, with its own limitations, and definitely not the preferred mode. It does have advantages in some situations (fentry pollutes LBR entries much less compared to kprobes).

It's very different for kprobe/kretprobe. Retsnoop by default will use multi-kprobes and will be able to attach to thousands of programs almost instantaneous. With just one program for entry and one for exit programs.

Thanks for the insights @anakryiko. The clone_prog part looks like what we're doing in bpftrace for every probe type now - call bpf_prog_load for each attachment target. I'd like to get rid of this approach since it prevents us from using struct bpf_object to manipulate BPF programs (and all the features that come with it). The idea was to do the cloning on the level of LLVM but in the case of fentry/fexit programs (or kprobes when kprobe-multi is not available), it may lead to a very large ELF objects, unless we're able to do the cloning efficiently (see below).

Duplicate Symbols

I can create duplicate symbols for functions with Clang, so there must be an interface for doing this in libLLVM: [...] Maybe MCContext::getOrCreateSymbol? https://llvm.org/doxygen/classllvm_1_1MCContext.html#ac11eef690074972378846024abbe8722

There's also symbol aliasing in LLVM which sounds like what we need. I'll have a look into it.

Feb 20 '24 08:02 viktormalik

There's also symbol aliasing in LLVM which sounds like what we need. I'll have a look into it.

I did some investigation and experiments here and found that using symbol aliases will indeed produce multiple symbols with the same address and libbpf will correctly discover them as separate BPF programs (and do a copy of the instructions for each). The problem is that libbpf relocations will not work b/c libbpf doesn't count with multiple programs sharing the same instructions in the ELF file.

This should be possible to fix on libbpf side but it's a bigger change so I'd suggest going with full expansion (i.e. one LLVM function per wildcard match) for the first version of #2334.

May 06 '24 08:05 viktormalik